Sub-band Convolutional Neural Networks for Small-footprint Spoken Term Classification
Pith reviewed 2026-05-25 10:22 UTC · model grok-4.3
The pith
Sub-band CNN applies different kernels to acoustic feature sub-bands to cut computation by up to 49% while maintaining accuracy on spoken term classification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The sub-band CNN architecture applies different convolutional kernels on each feature sub-band to make overall computation more efficient for spoken term classification. Experimental results on the Speech Commands dataset show that this reduces computation by 39.7% on commands classification and 49.3% on digits classification while maintaining accuracy compared to a full band CNN baseline.
What carries the argument
Sub-band CNN architecture applying different convolutional kernels on each feature sub-band to improve computational efficiency for acoustic feature maps.
Load-bearing premise
That applying different convolutional kernels on each feature sub-band preserves classification accuracy while lowering compute cost.
What would settle it
Running the sub-band CNN and full-band CNN on the Speech Commands dataset and finding that the sub-band version either requires more computation or achieves lower accuracy.
Figures
read the original abstract
This paper proposes a Sub-band Convolutional Neural Network for spoken term classification. Convolutional neural networks (CNNs) have proven to be very effective in acoustic applications such as spoken term classification, keyword spotting, speaker identification, acoustic event detection, etc. Unlike applications in computer vision, the spatial invariance property of 2D convolutional kernels does not fit acoustic applications well since the meaning of a specific 2D kernel varies a lot along the feature axis in an input feature map. We propose a sub-band CNN architecture to apply different convolutional kernels on each feature sub-band, which makes the overall computation more efficient. Experimental results show that the computational efficiency brought by sub-band CNN is more beneficial for small-footprint models. Compared to a baseline full band CNN for spoken term classification on a publicly available Speech Commands dataset, the proposed sub-band CNN architecture reduces the computation by 39.7% on commands classification, and 49.3% on digits classification with accuracy maintained.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a sub-band CNN architecture for spoken term classification on the Speech Commands dataset. It motivates the approach by noting that spatial invariance of standard 2D convolutional kernels fits acoustic feature maps poorly, and claims that applying different kernels per sub-band reduces computation (39.7% for commands classification, 49.3% for digits classification) while maintaining accuracy relative to a full-band baseline CNN, with particular benefits for small-footprint models.
Significance. If the empirical results hold under proper validation, the work could provide a useful efficiency technique for resource-constrained acoustic applications such as keyword spotting on embedded devices. The use of a public dataset allows direct comparison, and the focus on compute reduction without accuracy loss addresses a practical constraint in the field.
major comments (2)
- [Abstract] Abstract and experimental results: the central claim that accuracy is 'maintained' while achieving the stated compute reductions supplies no error bars, dataset splits, training protocol details, number of runs, or statistical tests. This absence makes the accuracy-maintenance assertion unverifiable from the reported text and is load-bearing for the efficiency claim.
- [Results] Results section: without reported variance across runs or baseline implementation specifics (e.g., exact kernel sizes, layer counts, or FLOPs calculation method), it is impossible to confirm the 39.7% and 49.3% reductions or to rule out that accuracy differences fall within experimental noise.
minor comments (1)
- [Introduction] The motivation paragraph on spatial invariance could benefit from a short concrete illustration (e.g., how a kernel's meaning changes across frequency bands) to strengthen the premise before the empirical test.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The concerns regarding experimental reproducibility and statistical rigor are valid, and we will revise the manuscript to address them by adding the requested details on variance, protocol, and implementation. This will strengthen the presentation of our efficiency claims without altering the core technical contribution.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental results: the central claim that accuracy is 'maintained' while achieving the stated compute reductions supplies no error bars, dataset splits, training protocol details, number of runs, or statistical tests. This absence makes the accuracy-maintenance assertion unverifiable from the reported text and is load-bearing for the efficiency claim.
Authors: We agree that the absence of these details limits verifiability. In the revised manuscript we will report mean accuracy and standard deviation over multiple independent training runs (with different random seeds), specify the exact train/validation/test splits from the Speech Commands dataset, describe the full training protocol (optimizer, learning rate, batch size, epochs, and regularization), state the number of runs performed, and include any statistical significance tests used to support the claim that accuracy is maintained within experimental variation. revision: yes
-
Referee: [Results] Results section: without reported variance across runs or baseline implementation specifics (e.g., exact kernel sizes, layer counts, or FLOPs calculation method), it is impossible to confirm the 39.7% and 49.3% reductions or to rule out that accuracy differences fall within experimental noise.
Authors: We concur that implementation specifics and variance measures are required for independent verification. The revision will explicitly list the architecture details (kernel sizes, number of layers, channel counts) for both the full-band baseline and sub-band models, describe the precise method used to compute FLOPs, and report accuracy and compute results with error bars across the same set of multiple runs. This will allow direct assessment of whether observed differences lie within noise. revision: yes
Circularity Check
No significant circularity
full rationale
The paper contains no derivation chain, equations, or fitted parameters. Its central claim is an empirical result: on the Speech Commands dataset the sub-band CNN achieves stated compute reductions (39.7% commands, 49.3% digits) while accuracy is maintained relative to a full-band baseline. The premise about 2-D kernel spatial invariance is offered only as motivation; the experiment itself tests the efficiency-accuracy trade-off. No self-citation, ansatz, or uniqueness theorem is load-bearing. The result is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction With the rapid development of public available datasets (e.g. spoken term classification [1], speaker identification [2, 3], acoustic event classification/detection [4, 5], etc.), state-of-the- art models for various acoustic applications can be trained with a large amount of annotated data. CNN-based architectures have achieved state-of-the-art...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[2]
Sub-band CNN We show implementation details of the proposed sub-band CNN in this section. We chose the “ cnn-trad-fpool3” model proposed in [6] as our baseline model. We used the im- plementation of “cnn-trad-fpool3” in Tensorflow official package [27] as the baseline, which is slightly different from the original model described in [6]. As shown in Fig. 1a...
-
[3]
Experimental Results 3.1. Datasets We tested the proposed model on Google Speech Commands dataset [1], which has 35 words in the latest version (v0.02). We chose two subsets as our testbed for spoken term classifi- cation tasks. For the formulation of subsets, we use the same setup as the “Audio Recognition” tutorial in the official Tensor- flow package [28]...
-
[4]
Conclusions In this paper, we proposed a sub-band CNN architecture and explored it for spoken term classification. We compare the pro- posed sub-band CNNs to full band CNNs and another weight sharing approach on two spoken term classification tasks. The proposed architecture of sub-band CNNs reduces the computa- tion by 39.7% on commands classification, and ...
-
[5]
Acknowledgement The authors would like to thank Weiran Wang, Krishna Puvvada, and Wei-Ning Hsu for useful discussions
-
[6]
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” CoRR, vol. abs/1804.03209, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
V oxceleb: a large- scale speaker identification dataset,
A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identification dataset,” inINTERSPEECH, 2017, pp. 2616–2620
work page 2017
-
[8]
V oxceleb2: Deep speaker recognition,
J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018, pp. 1086–1090
work page 2018
-
[9]
Audio set: An ontology and human-labeled dataset for audio events,
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” inIEEE ICASSP, 2017, pp. 776–780
work page 2017
-
[10]
DCASE 2017 challenge setup: Tasks, datasets and baseline system,
A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vin- cent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” inProceedings of the Detec- tion and Classification of Acoustic Scenes and Events 2017 Work- shop (DCASE2017), November 2017, pp. 85–92
work page 2017
-
[11]
Convolutional neural networks for small-footprint keyword spotting,
T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in INTERSPEECH, 2015, pp. 1478–1482
work page 2015
-
[12]
Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in IEEE ICASSP, 2012, pp. 4277–4280
work page 2012
-
[13]
Very deep convolutional neural networks for noise robust speech recognition,
Y . Qian, M. Bi, T. Tan, and K. Yu, “Very deep convolutional neural networks for noise robust speech recognition,” IEEE/ACM Trans- actions on Audio, Speech, and Language Processing , vol. 24, no. 12, pp. 2263–2276, Dec 2016
work page 2016
-
[14]
Deep convo- lutional neural networks and data augmentation for acoustic event recognition,
N. Takahashi, M. Gygli, B. Pfister, and L. V . Gool, “Deep convo- lutional neural networks and data augmentation for acoustic event recognition,” in INTERSPEECH, 2016, pp. 2982–2986
work page 2016
-
[15]
Cnn architec- tures for large-scale audio classification,
S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, “Cnn architec- tures for large-scale audio classification,” in IEEE ICASSP, 2017, pp. 131–135
work page 2017
-
[16]
Semi-supervised acoustic event detection based on tri-training,
B. Shi, M. Sun, C. Kao, V . Rozgic, S. Matsoukas, and C. Wang, “Semi-supervised acoustic event detection based on tri-training,” in IEEE ICASSP, 2019, pp. 750–754
work page 2019
-
[17]
Hierarchical residual-pyramidal model for large context based media presence detection,
Q. Tang, M. Sun, C. Kao, V . Rozgic, and C. Wang, “Hierarchical residual-pyramidal model for large context based media presence detection,” in IEEE ICASSP, 2019, pp. 3312–3316
work page 2019
-
[18]
Rare sound event detection using 1D convolutional recurrent neural networks,
H. Lim, J. Park, and Y . Han, “Rare sound event detection using 1D convolutional recurrent neural networks,” DCASE2017 Chal- lenge, Tech. Rep., September 2017
work page 2017
-
[19]
R-CRNN: region-based convolutional recurrent neural network for audio event detection,
C. Kao, W. Wang, M. Sun, and C. Wang, “R-CRNN: region-based convolutional recurrent neural network for audio event detection,” in INTERSPEECH, 2018, pp. 1358–1362
work page 2018
-
[20]
Multi-scale multi-band densenets for audio source separation,
N. Takahashi and Y . Mitsufuji, “Multi-scale multi-band densenets for audio source separation,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 21–25
work page 2017
-
[21]
Small-footprint keyword spotting using deep neural networks,
G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in IEEE ICASSP, 2014, pp. 4087–4091
work page 2014
-
[22]
Streaming small-footprint keyword spotting using sequence-to-sequence models,
Y . He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. McGraw, “Streaming small-footprint keyword spotting using sequence-to-sequence models,” inIEEE Automatic Speech Recog- nition and Understanding Workshop (ASRU), 2017, pp. 474–481
work page 2017
-
[23]
Convolutional recurrent neural networks for small-footprint keyword spotting,
S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, and A. Coates, “Convolutional recurrent neural networks for small-footprint keyword spotting,” inINTER- SPEECH, 2017, pp. 1606–1610
work page 2017
-
[24]
Deep residual learning for small-footprint keyword spotting,
R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” in IEEE ICASSP, 2018, pp. 5484–5488
work page 2018
-
[25]
Model compression applied to small-footprint keyword spotting,
G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. Vita- ladevuni, “Model compression applied to small-footprint keyword spotting,” in INTERSPEECH, 2016, pp. 1878–1882
work page 2016
-
[26]
Com- pressed time delay neural network for small-footprint keyword spotting,
M. Sun, D. Snyder, Y . Gao, V . Nagaraja, M. Rodehorst, S. Pan- chapagesan, N. Strom, S. Matsoukas, and S. Vitaladevuni, “Com- pressed time delay neural network for small-footprint keyword spotting,” in INTERSPEECH, 2017, pp. 3607–3611
work page 2017
-
[27]
Knowledge distillation for small- footprint highway networks,
L. Lu, M. Guo, and S. Renals, “Knowledge distillation for small- footprint highway networks,” in IEEE ICASSP, 2017, pp. 4820– 4824
work page 2017
-
[28]
Compression of end-to-end models,
R. Pang, T. Sainath, R. Prabhavalkar, S. Gupta, Y . Wu, S. Zhang, and C.-C. Chiu, “Compression of end-to-end models,” in INTER- SPEECH, 2018, pp. 27–31
work page 2018
-
[29]
Compression of acoustic event detection models with quantized distillation,
B. Shi, M. Sun, C. Kao, V . Rozgic, S. Matsoukas, and C. Wang, “Compression of acoustic event detection models with quantized distillation,” to appear in INTERSPEECH, 2019
work page 2019
-
[30]
O. Abdel-Hamid, L. Deng, and D. Yu, “Exploring convolutional neural network structures and optimization techniques for speech recognition,” in INTERSPEECH, 2013, pp. 3366–3370
work page 2013
-
[31]
S. S. R. Phaye, E. Benetos, and Y . Wang, “Subspectralnet - using sub-spectrogram based convolutional neural networks for acoustic scene classification,” in IEEE ICASSP, 2019, pp. 825–829
work page 2019
-
[32]
TensorFlow: Large-scale machine learning on heterogeneous systems,
M. Abadi, A. Agarwal et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https: //www.tensorflow.org/
work page 2015
-
[33]
TensorFlow: Simple audio recognition
“TensorFlow: Simple audio recognition.” [Online]. Available: https://www.tensorflow.org/tutorials/sequences/audio recognition/
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.