Sub-band Convolutional Neural Networks for Small-footprint Spoken Term Classification

Chao Wang; Chieh-Chi Kao; Ming Sun; Shiv Vitaladevuni; Yixin Gao

arxiv: 1907.01448 · v1 · pith:VHKE6BIOnew · submitted 2019-07-02 · 📡 eess.AS · cs.SD

Sub-band Convolutional Neural Networks for Small-footprint Spoken Term Classification

Chieh-Chi Kao , Ming Sun , Yixin Gao , Shiv Vitaladevuni , Chao Wang This is my paper

Pith reviewed 2026-05-25 10:22 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords sub-band CNNspoken term classificationconvolutional neural networkscomputational efficiencySpeech Commands datasetsmall-footprint modelsacoustic feature maps

0 comments

The pith

Sub-band CNN applies different kernels to acoustic feature sub-bands to cut computation by up to 49% while maintaining accuracy on spoken term classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a Sub-band Convolutional Neural Network for spoken term classification. Unlike standard CNNs, it applies different convolutional kernels on each feature sub-band because the spatial invariance of 2D kernels does not fit acoustic applications well. On the publicly available Speech Commands dataset, the sub-band CNN reduces computation by 39.7% for commands classification and 49.3% for digits classification compared to a baseline full-band CNN, with accuracy maintained. The efficiency gain is especially useful for small-footprint models in acoustic tasks.

Core claim

The sub-band CNN architecture applies different convolutional kernels on each feature sub-band to make overall computation more efficient for spoken term classification. Experimental results on the Speech Commands dataset show that this reduces computation by 39.7% on commands classification and 49.3% on digits classification while maintaining accuracy compared to a full band CNN baseline.

What carries the argument

Sub-band CNN architecture applying different convolutional kernels on each feature sub-band to improve computational efficiency for acoustic feature maps.

Load-bearing premise

That applying different convolutional kernels on each feature sub-band preserves classification accuracy while lowering compute cost.

What would settle it

Running the sub-band CNN and full-band CNN on the Speech Commands dataset and finding that the sub-band version either requires more computation or achieves lower accuracy.

Figures

Figures reproduced from arXiv: 1907.01448 by Chao Wang, Chieh-Chi Kao, Ming Sun, Shiv Vitaladevuni, Yixin Gao.

**Figure 1.** Figure 1: CNN models with different weight sharing methods. (a) The baseline model proposed in [6]. (b) Applying the multi-band approach proposed in [15] to the baseline model. (c) The proposed overlapped sub-band CNN. For the easiness of illustration, the x-axis is set as feature in this figure, which is different from conventional settings [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Accuracy curve of different weight sharing methods on subsets of Google Speech Commands dataset [1]. Each data point represents an average of five trials, and the error bar is the sample standard deviation of five trials [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy curve of experiments on number of sub-bands and concatenation methods for sub-band features. Commands classification is used as the testbed. Each data point represents an average of five trials, and the error bar is the sample standard deviation of five trials. with an error bar compared to plotting tens of DETs in a figure. 3.3. Results Fig. 2a and Fig. 2b show the accuracy curves for different w… view at source ↗

read the original abstract

This paper proposes a Sub-band Convolutional Neural Network for spoken term classification. Convolutional neural networks (CNNs) have proven to be very effective in acoustic applications such as spoken term classification, keyword spotting, speaker identification, acoustic event detection, etc. Unlike applications in computer vision, the spatial invariance property of 2D convolutional kernels does not fit acoustic applications well since the meaning of a specific 2D kernel varies a lot along the feature axis in an input feature map. We propose a sub-band CNN architecture to apply different convolutional kernels on each feature sub-band, which makes the overall computation more efficient. Experimental results show that the computational efficiency brought by sub-band CNN is more beneficial for small-footprint models. Compared to a baseline full band CNN for spoken term classification on a publicly available Speech Commands dataset, the proposed sub-band CNN architecture reduces the computation by 39.7% on commands classification, and 49.3% on digits classification with accuracy maintained.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sub-band CNN is a practical tweak for efficiency in small audio classifiers, but the abstract leaves the gains unverified and the full paper needs checking for details.

read the letter

The main takeaway is that this paper introduces a sub-band CNN for spoken term classification that splits frequency features and applies separate kernels per band, claiming 39.7% and 49.3% compute cuts on the Speech Commands dataset for commands and digits while keeping accuracy the same as a full-band baseline. That is the concrete new piece: a direct architectural response to the claim that 2D convolution invariance fits audio poorly. The motivation is straightforward and the focus on small-footprint models is relevant for edge use. Using a public dataset is also a plus, as it lets others reproduce the comparison. The paper does a clean job of tying the design choice to the efficiency goal without overclaiming broader impact. The soft spots sit in the evidence. The abstract gives no error bars, no description of sub-band boundaries, no training protocol, and no statistical tests, so the central efficiency claim cannot be checked from the text alone. If the full version supplies ablations showing the split itself drives the savings rather than other factors like total parameters, that would strengthen it; otherwise the result stays hard to interpret. Minor issues include the usual risk that a single dataset and baseline pair overstates the advantage. This is for people working on compact keyword spotting or audio classification on limited hardware. A reader who needs incremental CNN variants for speech might try the sub-band idea and see if it transfers. It is not a broad advance in audio modeling, but the proposal is specific enough to be worth testing. I would send it to peer review because the dataset is public, the architecture is simple to implement, and a referee can ask for the missing controls and runs. The work is coherent on its own terms even if the numbers need more support.

Referee Report

2 major / 1 minor

Summary. The paper proposes a sub-band CNN architecture for spoken term classification on the Speech Commands dataset. It motivates the approach by noting that spatial invariance of standard 2D convolutional kernels fits acoustic feature maps poorly, and claims that applying different kernels per sub-band reduces computation (39.7% for commands classification, 49.3% for digits classification) while maintaining accuracy relative to a full-band baseline CNN, with particular benefits for small-footprint models.

Significance. If the empirical results hold under proper validation, the work could provide a useful efficiency technique for resource-constrained acoustic applications such as keyword spotting on embedded devices. The use of a public dataset allows direct comparison, and the focus on compute reduction without accuracy loss addresses a practical constraint in the field.

major comments (2)

[Abstract] Abstract and experimental results: the central claim that accuracy is 'maintained' while achieving the stated compute reductions supplies no error bars, dataset splits, training protocol details, number of runs, or statistical tests. This absence makes the accuracy-maintenance assertion unverifiable from the reported text and is load-bearing for the efficiency claim.
[Results] Results section: without reported variance across runs or baseline implementation specifics (e.g., exact kernel sizes, layer counts, or FLOPs calculation method), it is impossible to confirm the 39.7% and 49.3% reductions or to rule out that accuracy differences fall within experimental noise.

minor comments (1)

[Introduction] The motivation paragraph on spatial invariance could benefit from a short concrete illustration (e.g., how a kernel's meaning changes across frequency bands) to strengthen the premise before the empirical test.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The concerns regarding experimental reproducibility and statistical rigor are valid, and we will revise the manuscript to address them by adding the requested details on variance, protocol, and implementation. This will strengthen the presentation of our efficiency claims without altering the core technical contribution.

read point-by-point responses

Referee: [Abstract] Abstract and experimental results: the central claim that accuracy is 'maintained' while achieving the stated compute reductions supplies no error bars, dataset splits, training protocol details, number of runs, or statistical tests. This absence makes the accuracy-maintenance assertion unverifiable from the reported text and is load-bearing for the efficiency claim.

Authors: We agree that the absence of these details limits verifiability. In the revised manuscript we will report mean accuracy and standard deviation over multiple independent training runs (with different random seeds), specify the exact train/validation/test splits from the Speech Commands dataset, describe the full training protocol (optimizer, learning rate, batch size, epochs, and regularization), state the number of runs performed, and include any statistical significance tests used to support the claim that accuracy is maintained within experimental variation. revision: yes
Referee: [Results] Results section: without reported variance across runs or baseline implementation specifics (e.g., exact kernel sizes, layer counts, or FLOPs calculation method), it is impossible to confirm the 39.7% and 49.3% reductions or to rule out that accuracy differences fall within experimental noise.

Authors: We concur that implementation specifics and variance measures are required for independent verification. The revision will explicitly list the architecture details (kernel sizes, number of layers, channel counts) for both the full-band baseline and sub-band models, describe the precise method used to compute FLOPs, and report accuracy and compute results with error bars across the same set of multiple runs. This will allow direct assessment of whether observed differences lie within noise. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper contains no derivation chain, equations, or fitted parameters. Its central claim is an empirical result: on the Speech Commands dataset the sub-band CNN achieves stated compute reductions (39.7% commands, 49.3% digits) while accuracy is maintained relative to a full-band baseline. The premise about 2-D kernel spatial invariance is offered only as motivation; the experiment itself tests the efficiency-accuracy trade-off. No self-citation, ansatz, or uniqueness theorem is load-bearing. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach implicitly assumes standard CNN training and a frequency-axis split whose boundaries are not stated.

pith-pipeline@v0.9.0 · 5705 in / 1109 out tokens · 28465 ms · 2026-05-25T10:22:34.623129+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 2 internal anchors

[1]

Introduction With the rapid development of public available datasets (e.g. spoken term classiﬁcation [1], speaker identiﬁcation [2, 3], acoustic event classiﬁcation/detection [4, 5], etc.), state-of-the- art models for various acoustic applications can be trained with a large amount of annotated data. CNN-based architectures have achieved state-of-the-art...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

cnn-trad-fpool3

Sub-band CNN We show implementation details of the proposed sub-band CNN in this section. We chose the “ cnn-trad-fpool3” model proposed in [6] as our baseline model. We used the im- plementation of “cnn-trad-fpool3” in Tensorﬂow ofﬁcial package [27] as the baseline, which is slightly different from the original model described in [6]. As shown in Fig. 1a...

work page
[3]

Audio Recognition

Experimental Results 3.1. Datasets We tested the proposed model on Google Speech Commands dataset [1], which has 35 words in the latest version (v0.02). We chose two subsets as our testbed for spoken term classiﬁ- cation tasks. For the formulation of subsets, we use the same setup as the “Audio Recognition” tutorial in the ofﬁcial Tensor- ﬂow package [28]...

work page
[4]

We compare the pro- posed sub-band CNNs to full band CNNs and another weight sharing approach on two spoken term classiﬁcation tasks

Conclusions In this paper, we proposed a sub-band CNN architecture and explored it for spoken term classiﬁcation. We compare the pro- posed sub-band CNNs to full band CNNs and another weight sharing approach on two spoken term classiﬁcation tasks. The proposed architecture of sub-band CNNs reduces the computa- tion by 39.7% on commands classiﬁcation, and ...

work page
[5]

Acknowledgement The authors would like to thank Weiran Wang, Krishna Puvvada, and Wei-Ning Hsu for useful discussions

work page
[6]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” CoRR, vol. abs/1804.03209, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

V oxceleb: a large- scale speaker identiﬁcation dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identiﬁcation dataset,” inINTERSPEECH, 2017, pp. 2616–2620

work page 2017
[8]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018, pp. 1086–1090

work page 2018
[9]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” inIEEE ICASSP, 2017, pp. 776–780

work page 2017
[10]

DCASE 2017 challenge setup: Tasks, datasets and baseline system,

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vin- cent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” inProceedings of the Detec- tion and Classiﬁcation of Acoustic Scenes and Events 2017 Work- shop (DCASE2017), November 2017, pp. 85–92

work page 2017
[11]

Convolutional neural networks for small-footprint keyword spotting,

T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in INTERSPEECH, 2015, pp. 1478–1482

work page 2015
[12]

Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,

O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in IEEE ICASSP, 2012, pp. 4277–4280

work page 2012
[13]

Very deep convolutional neural networks for noise robust speech recognition,

Y . Qian, M. Bi, T. Tan, and K. Yu, “Very deep convolutional neural networks for noise robust speech recognition,” IEEE/ACM Trans- actions on Audio, Speech, and Language Processing , vol. 24, no. 12, pp. 2263–2276, Dec 2016

work page 2016
[14]

Deep convo- lutional neural networks and data augmentation for acoustic event recognition,

N. Takahashi, M. Gygli, B. Pﬁster, and L. V . Gool, “Deep convo- lutional neural networks and data augmentation for acoustic event recognition,” in INTERSPEECH, 2016, pp. 2982–2986

work page 2016
[15]

Cnn architec- tures for large-scale audio classiﬁcation,

S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, “Cnn architec- tures for large-scale audio classiﬁcation,” in IEEE ICASSP, 2017, pp. 131–135

work page 2017
[16]

Semi-supervised acoustic event detection based on tri-training,

B. Shi, M. Sun, C. Kao, V . Rozgic, S. Matsoukas, and C. Wang, “Semi-supervised acoustic event detection based on tri-training,” in IEEE ICASSP, 2019, pp. 750–754

work page 2019
[17]

Hierarchical residual-pyramidal model for large context based media presence detection,

Q. Tang, M. Sun, C. Kao, V . Rozgic, and C. Wang, “Hierarchical residual-pyramidal model for large context based media presence detection,” in IEEE ICASSP, 2019, pp. 3312–3316

work page 2019
[18]

Rare sound event detection using 1D convolutional recurrent neural networks,

H. Lim, J. Park, and Y . Han, “Rare sound event detection using 1D convolutional recurrent neural networks,” DCASE2017 Chal- lenge, Tech. Rep., September 2017

work page 2017
[19]

R-CRNN: region-based convolutional recurrent neural network for audio event detection,

C. Kao, W. Wang, M. Sun, and C. Wang, “R-CRNN: region-based convolutional recurrent neural network for audio event detection,” in INTERSPEECH, 2018, pp. 1358–1362

work page 2018
[20]

Multi-scale multi-band densenets for audio source separation,

N. Takahashi and Y . Mitsufuji, “Multi-scale multi-band densenets for audio source separation,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 21–25

work page 2017
[21]

Small-footprint keyword spotting using deep neural networks,

G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in IEEE ICASSP, 2014, pp. 4087–4091

work page 2014
[22]

Streaming small-footprint keyword spotting using sequence-to-sequence models,

Y . He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. McGraw, “Streaming small-footprint keyword spotting using sequence-to-sequence models,” inIEEE Automatic Speech Recog- nition and Understanding Workshop (ASRU), 2017, pp. 474–481

work page 2017
[23]

Convolutional recurrent neural networks for small-footprint keyword spotting,

S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, and A. Coates, “Convolutional recurrent neural networks for small-footprint keyword spotting,” inINTER- SPEECH, 2017, pp. 1606–1610

work page 2017
[24]

Deep residual learning for small-footprint keyword spotting,

R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” in IEEE ICASSP, 2018, pp. 5484–5488

work page 2018
[25]

Model compression applied to small-footprint keyword spotting,

G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. Vita- ladevuni, “Model compression applied to small-footprint keyword spotting,” in INTERSPEECH, 2016, pp. 1878–1882

work page 2016
[26]

Com- pressed time delay neural network for small-footprint keyword spotting,

M. Sun, D. Snyder, Y . Gao, V . Nagaraja, M. Rodehorst, S. Pan- chapagesan, N. Strom, S. Matsoukas, and S. Vitaladevuni, “Com- pressed time delay neural network for small-footprint keyword spotting,” in INTERSPEECH, 2017, pp. 3607–3611

work page 2017
[27]

Knowledge distillation for small- footprint highway networks,

L. Lu, M. Guo, and S. Renals, “Knowledge distillation for small- footprint highway networks,” in IEEE ICASSP, 2017, pp. 4820– 4824

work page 2017
[28]

Compression of end-to-end models,

R. Pang, T. Sainath, R. Prabhavalkar, S. Gupta, Y . Wu, S. Zhang, and C.-C. Chiu, “Compression of end-to-end models,” in INTER- SPEECH, 2018, pp. 27–31

work page 2018
[29]

Compression of acoustic event detection models with quantized distillation,

B. Shi, M. Sun, C. Kao, V . Rozgic, S. Matsoukas, and C. Wang, “Compression of acoustic event detection models with quantized distillation,” to appear in INTERSPEECH, 2019

work page 2019
[30]

Exploring convolutional neural network structures and optimization techniques for speech recognition,

O. Abdel-Hamid, L. Deng, and D. Yu, “Exploring convolutional neural network structures and optimization techniques for speech recognition,” in INTERSPEECH, 2013, pp. 3366–3370

work page 2013
[31]

Subspectralnet - using sub-spectrogram based convolutional neural networks for acoustic scene classiﬁcation,

S. S. R. Phaye, E. Benetos, and Y . Wang, “Subspectralnet - using sub-spectrogram based convolutional neural networks for acoustic scene classiﬁcation,” in IEEE ICASSP, 2019, pp. 825–829

work page 2019
[32]

TensorFlow: Large-scale machine learning on heterogeneous systems,

M. Abadi, A. Agarwal et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorﬂow.org. [Online]. Available: https: //www.tensorﬂow.org/

work page 2015
[33]

TensorFlow: Simple audio recognition

“TensorFlow: Simple audio recognition.” [Online]. Available: https://www.tensorﬂow.org/tutorials/sequences/audio recognition/

work page

[1] [1]

Introduction With the rapid development of public available datasets (e.g. spoken term classiﬁcation [1], speaker identiﬁcation [2, 3], acoustic event classiﬁcation/detection [4, 5], etc.), state-of-the- art models for various acoustic applications can be trained with a large amount of annotated data. CNN-based architectures have achieved state-of-the-art...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

cnn-trad-fpool3

Sub-band CNN We show implementation details of the proposed sub-band CNN in this section. We chose the “ cnn-trad-fpool3” model proposed in [6] as our baseline model. We used the im- plementation of “cnn-trad-fpool3” in Tensorﬂow ofﬁcial package [27] as the baseline, which is slightly different from the original model described in [6]. As shown in Fig. 1a...

work page

[3] [3]

Audio Recognition

Experimental Results 3.1. Datasets We tested the proposed model on Google Speech Commands dataset [1], which has 35 words in the latest version (v0.02). We chose two subsets as our testbed for spoken term classiﬁ- cation tasks. For the formulation of subsets, we use the same setup as the “Audio Recognition” tutorial in the ofﬁcial Tensor- ﬂow package [28]...

work page

[4] [4]

We compare the pro- posed sub-band CNNs to full band CNNs and another weight sharing approach on two spoken term classiﬁcation tasks

Conclusions In this paper, we proposed a sub-band CNN architecture and explored it for spoken term classiﬁcation. We compare the pro- posed sub-band CNNs to full band CNNs and another weight sharing approach on two spoken term classiﬁcation tasks. The proposed architecture of sub-band CNNs reduces the computa- tion by 39.7% on commands classiﬁcation, and ...

work page

[5] [5]

Acknowledgement The authors would like to thank Weiran Wang, Krishna Puvvada, and Wei-Ning Hsu for useful discussions

work page

[6] [6]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” CoRR, vol. abs/1804.03209, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

V oxceleb: a large- scale speaker identiﬁcation dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identiﬁcation dataset,” inINTERSPEECH, 2017, pp. 2616–2620

work page 2017

[8] [8]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018, pp. 1086–1090

work page 2018

[9] [9]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” inIEEE ICASSP, 2017, pp. 776–780

work page 2017

[10] [10]

DCASE 2017 challenge setup: Tasks, datasets and baseline system,

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vin- cent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” inProceedings of the Detec- tion and Classiﬁcation of Acoustic Scenes and Events 2017 Work- shop (DCASE2017), November 2017, pp. 85–92

work page 2017

[11] [11]

Convolutional neural networks for small-footprint keyword spotting,

T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in INTERSPEECH, 2015, pp. 1478–1482

work page 2015

[12] [12]

Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,

O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in IEEE ICASSP, 2012, pp. 4277–4280

work page 2012

[13] [13]

Very deep convolutional neural networks for noise robust speech recognition,

Y . Qian, M. Bi, T. Tan, and K. Yu, “Very deep convolutional neural networks for noise robust speech recognition,” IEEE/ACM Trans- actions on Audio, Speech, and Language Processing , vol. 24, no. 12, pp. 2263–2276, Dec 2016

work page 2016

[14] [14]

Deep convo- lutional neural networks and data augmentation for acoustic event recognition,

N. Takahashi, M. Gygli, B. Pﬁster, and L. V . Gool, “Deep convo- lutional neural networks and data augmentation for acoustic event recognition,” in INTERSPEECH, 2016, pp. 2982–2986

work page 2016

[15] [15]

Cnn architec- tures for large-scale audio classiﬁcation,

S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, “Cnn architec- tures for large-scale audio classiﬁcation,” in IEEE ICASSP, 2017, pp. 131–135

work page 2017

[16] [16]

Semi-supervised acoustic event detection based on tri-training,

B. Shi, M. Sun, C. Kao, V . Rozgic, S. Matsoukas, and C. Wang, “Semi-supervised acoustic event detection based on tri-training,” in IEEE ICASSP, 2019, pp. 750–754

work page 2019

[17] [17]

Hierarchical residual-pyramidal model for large context based media presence detection,

Q. Tang, M. Sun, C. Kao, V . Rozgic, and C. Wang, “Hierarchical residual-pyramidal model for large context based media presence detection,” in IEEE ICASSP, 2019, pp. 3312–3316

work page 2019

[18] [18]

Rare sound event detection using 1D convolutional recurrent neural networks,

H. Lim, J. Park, and Y . Han, “Rare sound event detection using 1D convolutional recurrent neural networks,” DCASE2017 Chal- lenge, Tech. Rep., September 2017

work page 2017

[19] [19]

R-CRNN: region-based convolutional recurrent neural network for audio event detection,

C. Kao, W. Wang, M. Sun, and C. Wang, “R-CRNN: region-based convolutional recurrent neural network for audio event detection,” in INTERSPEECH, 2018, pp. 1358–1362

work page 2018

[20] [20]

Multi-scale multi-band densenets for audio source separation,

N. Takahashi and Y . Mitsufuji, “Multi-scale multi-band densenets for audio source separation,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 21–25

work page 2017

[21] [21]

Small-footprint keyword spotting using deep neural networks,

G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in IEEE ICASSP, 2014, pp. 4087–4091

work page 2014

[22] [22]

Streaming small-footprint keyword spotting using sequence-to-sequence models,

Y . He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. McGraw, “Streaming small-footprint keyword spotting using sequence-to-sequence models,” inIEEE Automatic Speech Recog- nition and Understanding Workshop (ASRU), 2017, pp. 474–481

work page 2017

[23] [23]

Convolutional recurrent neural networks for small-footprint keyword spotting,

S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, and A. Coates, “Convolutional recurrent neural networks for small-footprint keyword spotting,” inINTER- SPEECH, 2017, pp. 1606–1610

work page 2017

[24] [24]

Deep residual learning for small-footprint keyword spotting,

R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” in IEEE ICASSP, 2018, pp. 5484–5488

work page 2018

[25] [25]

Model compression applied to small-footprint keyword spotting,

G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. Vita- ladevuni, “Model compression applied to small-footprint keyword spotting,” in INTERSPEECH, 2016, pp. 1878–1882

work page 2016

[26] [26]

Com- pressed time delay neural network for small-footprint keyword spotting,

M. Sun, D. Snyder, Y . Gao, V . Nagaraja, M. Rodehorst, S. Pan- chapagesan, N. Strom, S. Matsoukas, and S. Vitaladevuni, “Com- pressed time delay neural network for small-footprint keyword spotting,” in INTERSPEECH, 2017, pp. 3607–3611

work page 2017

[27] [27]

Knowledge distillation for small- footprint highway networks,

L. Lu, M. Guo, and S. Renals, “Knowledge distillation for small- footprint highway networks,” in IEEE ICASSP, 2017, pp. 4820– 4824

work page 2017

[28] [28]

Compression of end-to-end models,

R. Pang, T. Sainath, R. Prabhavalkar, S. Gupta, Y . Wu, S. Zhang, and C.-C. Chiu, “Compression of end-to-end models,” in INTER- SPEECH, 2018, pp. 27–31

work page 2018

[29] [29]

Compression of acoustic event detection models with quantized distillation,

B. Shi, M. Sun, C. Kao, V . Rozgic, S. Matsoukas, and C. Wang, “Compression of acoustic event detection models with quantized distillation,” to appear in INTERSPEECH, 2019

work page 2019

[30] [30]

Exploring convolutional neural network structures and optimization techniques for speech recognition,

O. Abdel-Hamid, L. Deng, and D. Yu, “Exploring convolutional neural network structures and optimization techniques for speech recognition,” in INTERSPEECH, 2013, pp. 3366–3370

work page 2013

[31] [31]

Subspectralnet - using sub-spectrogram based convolutional neural networks for acoustic scene classiﬁcation,

S. S. R. Phaye, E. Benetos, and Y . Wang, “Subspectralnet - using sub-spectrogram based convolutional neural networks for acoustic scene classiﬁcation,” in IEEE ICASSP, 2019, pp. 825–829

work page 2019

[32] [32]

TensorFlow: Large-scale machine learning on heterogeneous systems,

M. Abadi, A. Agarwal et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorﬂow.org. [Online]. Available: https: //www.tensorﬂow.org/

work page 2015

[33] [33]

TensorFlow: Simple audio recognition

“TensorFlow: Simple audio recognition.” [Online]. Available: https://www.tensorﬂow.org/tutorials/sequences/audio recognition/

work page