Compression of Acoustic Event Detection Models With Quantized Distillation

Bowen Shi; Chao Wang; Chieh-Chi Kao; Ming Sun; Spyros Matsoukas; Viktor Rozgic

arxiv: 1907.00873 · v1 · pith:OP5GPKP7new · submitted 2019-07-01 · 📡 eess.AS · cs.LG· cs.SD

Compression of Acoustic Event Detection Models With Quantized Distillation

Bowen Shi , Ming Sun , Chieh-Chi Kao , Viktor Rozgic , Spyros Matsoukas , Chao Wang This is my paper

Pith reviewed 2026-05-25 11:19 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD

keywords acoustic event detectionmodel compressionknowledge distillationquantizationdeep neural networkscompact modelsedge deployment

0 comments

The pith

Combining distillation and quantization compresses large acoustic event detection models to 2% of teacher size while reducing error rates by 15%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that a joint approach of knowledge distillation followed by quantization can turn large teacher networks for acoustic event detection into compact student models that are both smaller and more accurate. This matters because current high-performing AED models demand too much computation to run on everyday devices. Distillation first improves the student's error rate over an ordinary compact network, then quantization delivers the bulk of the size cut. If the results hold, accurate event detection becomes feasible on hardware with tight memory and power limits.

Core claim

The paper claims that jointly leveraging knowledge distillation and quantization compresses a larger teacher model into a compact student model for acoustic event detection. This lowers the error rate of the original compact network by 15% through distillation and reduces model size to 2% of the teacher and 12% of the full-precision student through quantization.

What carries the argument

Joint knowledge distillation from a teacher AED model to a student model followed by quantization of the student.

If this is right

The resulting student models fit on devices with limited memory and compute.
Detection accuracy improves relative to an uncompressed compact baseline.
Memory footprint drops to roughly one-fiftieth of the original teacher model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage compression could be tried on related audio tasks such as sound classification or keyword spotting.
Quantizing after distillation may preserve more accuracy than quantizing first and then distilling.
Measuring inference latency on actual embedded hardware would show whether the size cut translates into usable speed gains.

Load-bearing premise

The 15% error reduction and extreme size savings seen with the chosen teacher-student pair, datasets, and quantization scheme will appear in other settings.

What would settle it

Applying the same distillation-then-quantization pipeline to a different AED dataset or architecture and measuring no drop in error rate or no reduction below the full-precision student size.

read the original abstract

Acoustic Event Detection (AED), aiming at detecting categories of events based on audio signals, has found application in many intelligent systems. Recently deep neural network significantly advances this field and reduces detection errors to a large scale. However how to efficiently execute deep models in AED has received much less attention. Meanwhile state-of-the-art AED models are based on large deep models, which are computational demanding and challenging to deploy on devices with constrained computational resources. In this paper, we present a simple yet effective compression approach which jointly leverages knowledge distillation and quantization to compress larger network (teacher model) into compact network (student model). Experimental results show proposed technique not only lowers error rate of original compact network by 15% through distillation but also further reduces its model size to a large extent (2% of teacher, 12% of full-precision student) through quantization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies distillation then quantization to shrink AED models and claims a 15% error drop plus big size cuts, but the abstract leaves out datasets, baselines, bit widths, and variance so the numbers are hard to judge.

read the letter

The core result is a joint compression pipeline for acoustic event detection: distill from a large teacher into a compact student to cut error rate by 15%, then quantize the student down to 2% of teacher size and 12% of the full-precision student. That combination is the only thing presented as new; the individual techniques are standard. The practical angle is clear—large AED models are too heavy for edge devices, so showing a simple recipe that improves accuracy while slashing size is the useful part. If the experiments are solid, practitioners working on on-device audio could borrow the approach without much trouble. The soft spots sit exactly where the stress-test note flags them. The abstract gives no dataset names, no absolute error rates for teacher or baselines, no bit-width or quantization scheme that produces the 12% figure, and no error bars or run counts. The procedure itself is underspecified: it is not clear whether quantization happens after distillation, inside the distillation loss, or via some other schedule. Without those controls the 15% gain and extreme size numbers cannot be audited or reproduced from the given text. If the full paper supplies proper ablations, statistical significance, and failure cases, the work becomes a straightforward engineering note worth reading for the target community. If those details are missing or weak, the claims rest on unreported experimental choices. This is for people who deploy audio models on constrained hardware rather than for theorists. A reader who needs a compression recipe for AED might extract a usable idea, but anyone wanting to cite or extend the numbers would need the full experimental section first. I would send it to peer review so referees can verify the controls and decide whether the reported gains are real.

Referee Report

3 major / 2 minor

Summary. The paper proposes a joint knowledge-distillation-plus-quantization procedure to compress a large teacher acoustic-event-detection network into a compact student. The central empirical claim, stated in the abstract, is that distillation alone reduces the student's error rate by 15 % relative to an undistilled compact baseline, while subsequent quantization shrinks the student to 2 % of the teacher's size and 12 % of the full-precision student's size.

Significance. If the numerical claims are reproducible and generalize, the result would be useful for deploying AED models on resource-limited hardware; the combination of distillation and quantization is a standard compression recipe whose joint application to this task has not been widely reported. The manuscript supplies no machine-checked proofs, open code, or parameter-free derivations, so its contribution rests entirely on the strength of the (currently unreported) experiments.

major comments (3)

[Abstract] Abstract: the 15 % error-rate reduction is stated without absolute baseline or teacher error rates, without dataset identity or size, without bit-width or quantization scheme, and without error bars or number of runs. These omissions make the central numerical claim impossible to evaluate or reproduce.
[Abstract] Abstract: the phrase 'quantized distillation' is introduced without any description of the training procedure (post-hoc quantization, quantization-aware training, joint loss, etc.). Because the size-reduction numbers (2 % / 12 %) depend on this choice, the interaction between the two techniques cannot be assessed.
[Abstract] The manuscript contains no equations, algorithm box, or pseudocode that would allow a reader to implement the claimed joint procedure; the entire contribution is therefore carried by the experimental section, which is not described in the supplied text.

minor comments (2)

[Abstract] Abstract, sentence 3: 'deep neural network significantly advances' should read 'deep neural networks have significantly advanced'.
[Abstract] Abstract, sentence 4: 'computational demanding' should read 'computationally demanding'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each major comment below and will revise the manuscript to improve the abstract's clarity and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: the 15 % error-rate reduction is stated without absolute baseline or teacher error rates, without dataset identity or size, without bit-width or quantization scheme, and without error bars or number of runs. These omissions make the central numerical claim impossible to evaluate or reproduce.

Authors: We agree that the abstract would benefit from additional context. In the revised version we will report the absolute error rates for the teacher and the undistilled compact baseline, identify the dataset and its size, specify the quantization bit-width and scheme, and indicate the number of runs or variability. revision: yes
Referee: [Abstract] Abstract: the phrase 'quantized distillation' is introduced without any description of the training procedure (post-hoc quantization, quantization-aware training, joint loss, etc.). Because the size-reduction numbers (2 % / 12 %) depend on this choice, the interaction between the two techniques cannot be assessed.

Authors: We agree a brief description is warranted. The revised abstract will state that the method performs joint training with a combined knowledge-distillation and quantization loss. revision: yes
Referee: [Abstract] The manuscript contains no equations, algorithm box, or pseudocode that would allow a reader to implement the claimed joint procedure; the entire contribution is therefore carried by the experimental section, which is not described in the supplied text.

Authors: The full manuscript contains an experimental section with implementation details. To address the concern we will add a short algorithm box or pseudocode outlining the joint procedure in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivation chain

full rationale

The paper describes an empirical method combining knowledge distillation and quantization for model compression in acoustic event detection. The abstract and available text contain no equations, derivations, fitted parameters presented as predictions, or self-citations that bear load on a central claim. Results are reported as experimental outcomes (error rate reduction, size ratios) without any reduction to self-defined quantities or ansatzes. This matches the default expectation of a non-circular empirical paper; no steps qualify under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities are described; the contribution is an empirical compression recipe.

pith-pipeline@v0.9.0 · 5691 in / 1006 out tokens · 22863 ms · 2026-05-25T11:19:37.889639+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

In surveillance systems, audio i s used either independently or in conjunction with visual mod al- ity for scene analysis

Introduction Acoustic event detection (AED), the task of detecting the occur- rence of certain events based on audio streams, can be widely applied in many scenarios. In surveillance systems, audio i s used either independently or in conjunction with visual mod al- ity for scene analysis. For example, [1] applies AED model to detect hazardous events and p...

work page
[2]

Related work Neural network compression has been well explored in broad context. Knowledge distillation [7] is a commonly used tech - nique for model compression, which consists of training a co m- pact student network with distilled knowledge from a large teacher network. Knowledge distillation has been widely ap - plied in various domains, including aut...

work page
[3]

[21] combines quantization and low-ran k matrix factorization technique to compress multi-layer re cur- rent neural network

investigates compression of CNNs and their method is on simpliﬁcation of architectures by introducing bottleneck layers and global pooling. [21] combines quantization and low-ran k matrix factorization technique to compress multi-layer re cur- rent neural network. In [22] knowledge distillation is appl ied to train CNNs of small footprint. This paper focu...

work page
[4]

Given an audio signal I (e.g

Methods We start by formulating the multi-class acoustic event dete ction problem. Given an audio signal I (e.g. log mel-ﬁlter bank ener- gies (LFBEs)), the task is to train a model f to predict a multi- hot vector y ∈ { 0, 1}C , with C being the size of event set E , and yc being a binary indicator whether event c is present in I. Note the prediction f (...

work page
[5]

Compared to CNNs, RNN has folllowing advantages: (1)

is ResNet [24] with 50 layers. Compared to CNNs, RNN has folllowing advantages: (1). It is more compact and induc es much less computation compared to a deep CNN (see table 2 in experimental section for detailed comparison) (2). For C NN, entire sequence of raw input and its sub-sampled feature se- quence of each intermediate layer have to be stored in me...

work page
[6]

Experimental Setting Data The dataset we use is a subset from Audioset [25], which contains a large amount of 10-second audio clips

Experiments 4.1. Experimental Setting Data The dataset we use is a subset from Audioset [25], which contains a large amount of 10-second audio clips. In particu - lar, we select dog sound, baby crying and gunshots as the tar- get events. These three events included in Audioset amount t o 13,460, 2,313 and 4,083 respectively, and we use all of them. In add...

work page
[7]

Our compression scheme jointly applies knowledge distillation and quantization to the tar get model

Conclusion We study the model compression problem in the context of acoustic event detection. Our compression scheme jointly applies knowledge distillation and quantization to the tar get model. Experimental results show that the performance of sh al- low LSTM model can be greatly improved via knowledge dis- tillation without increase of size. The distill...

work page
[8]

Automatic detection and classiﬁcation of aud io events for road surveillance applications,

N. Almaadeed, M. Asim, S. Al-maadeed, A. Bouridane, and A. Beghdadi, “Automatic detection and classiﬁcation of aud io events for road surveillance applications,” vol. 18, p. 185 8, 06 2018

work page 2018
[9]

A Closer Look at Weak Label Learning for Audio Events

A. Shah, A. Kumar, A. G. Hauptmann, and B. Raj, “A closer look at weak label learning for audio events,” CoRR, vol. abs/1804.09288, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Cnn architectures for large-scale audio classiﬁcation,

S. Hershey, S. Chaudhuri, D. P . W. Ellis, J. F. Gemmeke, A. Jansen, C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Sey - bold, M. Slaney, R. Weiss, and K. Wilson, “Cnn architectures for large-scale audio classiﬁcation,” in ICASSP, 2017

work page 2017
[11]

Con- volutional gated recurrent neural network incorporating s patial features for audio tagging,

Y . Xu, Q. Kong, Q. Huang, W. Wang, and M. D. Plumbley, “Con- volutional gated recurrent neural network incorporating s patial features for audio tagging,” in IJCNN, 2017

work page 2017
[12]

Deep conv olu- tional neural networks and data augmentation for acoustic e vent detection,

N. Takahashi, M. Gygli, B. Pﬁster, and L. Gool, “Deep conv olu- tional neural networks and data augmentation for acoustic e vent detection,” CoRR, 2016

work page 2016
[13]

Convolutional recurrent neur al net- works for rare sound event detection,

E. Cakir and T. Virtanen, “Convolutional recurrent neur al net- works for rare sound event detection,” in DCASE2017, pp. 27–31

work page
[14]

Distilling the kno wledge in a neural network,

G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the kno wledge in a neural network,” in CoRR, 2015

work page 2015
[15]

Distilling knowledge from ensembles of neural networks for speech recognition,

A. Waters and Y . Chebotar, “Distilling knowledge from ensembles of neural networks for speech recognition,” in Interspeech, 2016

work page 2016
[16]

Knowledge distillation for small- footprint highway networks,

L. Lu, M. Guo, and S. Renals, “Knowledge distillation for small- footprint highway networks,” in ICASSP, 2017

work page 2017
[17]

Compression of end-to-end models,

R. Pang, T.N.Sainath, R. Prabhavalkar, S. Gupta, Y . Wu, S. Zhang, and C. Chiu, “Compression of end-to-end models,” in Inter- speech, 2018

work page 2018
[18]

Learn ing efﬁcient object detection models with knowledge distillat ion,

G. Chen, W. Choi, X. Y u, T. Han, and M. Chandraker, “Learn ing efﬁcient object detection models with knowledge distillat ion,” in NIPS, 2017

work page 2017
[19]

Quantized neural networks: Training neural networks with low precision weights and activations,

I. Hubara, M. Courbariaux, D. Soudry, R. El-Y aniv, and Y . Ben- gio, “Quantized neural networks: Training neural networks with low precision weights and activations,” Journal of Machine Learning Research, vol. 18, 2018

work page 2018
[20]

Effective Quantization Methods for Recurrent Neural Networks

Q. He, H. Wen, S. Zhou, Y . Wu, C. Y ao, X. Zhou, and Y . Zou, “Effective quantization methods for recurrent neural netw orks,” arXiv:1611.10176, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

On the efﬁ cient representation and execution of deep acoustic models,

R. Alvarez, R. Prabhavalkar, and A. Bakhtin, “On the efﬁ cient representation and execution of deep acoustic models,” in Inter- speech, 2016

work page 2016
[22]

Model compress ion via distillation and quantization,

A. Polino, R. Pascanu, and D. Alistarh, “Model compress ion via distillation and quantization,” in ICLR, 2018

work page 2018
[23]

Low-rank matrix factorization for deep neural network training with high-dimensional output targets,

T. N. Sainath, B. Kingsbury, V . Sindhwani, E. Arisoy, and B. Ram- abhadran, “Low-rank matrix factorization for deep neural network training with high-dimensional output targets,” in ICASSP, 2013

work page 2013
[24]

On the compression of recurrent neural networks with an applic ation to lvcsr acoustic modeling for embedded speech recognition ,

R. Prabhavalkar, O. Alsharif, A. Bruguier, and I. McGra w, “On the compression of recurrent neural networks with an applic ation to lvcsr acoustic modeling for embedded speech recognition ,” in ICASSP, 2016

work page 2016
[25]

Model compression applied to small-footprint keyword spotting,

G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. V ita- ladevuni, “Model compression applied to small-footprint keyword spotting,” Interspeech, 2016

work page 2016
[26]

C om- pressed time delay neural network for small-footprint keyw ord spotting,

M. Sun, D. Snyder, Y . Gao, V . Nagaraja, M. Rodehorst, S. P an- chapagesan, N. Strom, S. Matsoukas, and S. Vitaladevuni, “C om- pressed time delay neural network for small-footprint keyw ord spotting,” Interspeech, 2017

work page 2017
[27]

Reducing model complexity for dnn base d large-scale audio classiﬁcation,

Y . Wu and T. Lee, “Reducing model complexity for dnn base d large-scale audio classiﬁcation,” in ICASSP, 2018

work page 2018
[28]

Compression of acoustic event detection models with low-r ank matrix factorization and quantization training,

B. Shi, M. Sun, C.-C. Kao, V . Rozgic, S. Matsoukas, and C. Wang, “Compression of acoustic event detection models with low-r ank matrix factorization and quantization training,” NeurIPS work- shop on Compact Deep Neural Networks with industrial appli- cations, 2018

work page 2018
[29]

Teacher-stude nt train- ing for acoustic event detection using audioset,

R. Shi, R. W. M. Ng, and P . Swietojanski, “Teacher-stude nt train- ing for acoustic event detection using audioset,” ICASSP, 2019

work page 2019
[30]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y . Bengio, N. Leonard, and A. Courville, “Estimating or propa- gating gradients through stochastic neurons for condition al com- putation,” CoRR, abs/1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[31]

Deep residual learni ng for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learni ng for image recognition,” in CVPR, 2016

work page 2016
[32]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. Ellis, D. Freedman, A. Jansen, W. Lawre nce, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017

work page 2017
[33]

Densely con- nected convolutional networks,

G. Huang, Z. Liu, L. Maaten, and K. Weinberger, “Densely con- nected convolutional networks,” in CVPR, 2017

work page 2017

[1] [1]

In surveillance systems, audio i s used either independently or in conjunction with visual mod al- ity for scene analysis

Introduction Acoustic event detection (AED), the task of detecting the occur- rence of certain events based on audio streams, can be widely applied in many scenarios. In surveillance systems, audio i s used either independently or in conjunction with visual mod al- ity for scene analysis. For example, [1] applies AED model to detect hazardous events and p...

work page

[2] [2]

Related work Neural network compression has been well explored in broad context. Knowledge distillation [7] is a commonly used tech - nique for model compression, which consists of training a co m- pact student network with distilled knowledge from a large teacher network. Knowledge distillation has been widely ap - plied in various domains, including aut...

work page

[3] [3]

[21] combines quantization and low-ran k matrix factorization technique to compress multi-layer re cur- rent neural network

investigates compression of CNNs and their method is on simpliﬁcation of architectures by introducing bottleneck layers and global pooling. [21] combines quantization and low-ran k matrix factorization technique to compress multi-layer re cur- rent neural network. In [22] knowledge distillation is appl ied to train CNNs of small footprint. This paper focu...

work page

[4] [4]

Given an audio signal I (e.g

Methods We start by formulating the multi-class acoustic event dete ction problem. Given an audio signal I (e.g. log mel-ﬁlter bank ener- gies (LFBEs)), the task is to train a model f to predict a multi- hot vector y ∈ { 0, 1}C , with C being the size of event set E , and yc being a binary indicator whether event c is present in I. Note the prediction f (...

work page

[5] [5]

Compared to CNNs, RNN has folllowing advantages: (1)

is ResNet [24] with 50 layers. Compared to CNNs, RNN has folllowing advantages: (1). It is more compact and induc es much less computation compared to a deep CNN (see table 2 in experimental section for detailed comparison) (2). For C NN, entire sequence of raw input and its sub-sampled feature se- quence of each intermediate layer have to be stored in me...

work page

[6] [6]

Experimental Setting Data The dataset we use is a subset from Audioset [25], which contains a large amount of 10-second audio clips

Experiments 4.1. Experimental Setting Data The dataset we use is a subset from Audioset [25], which contains a large amount of 10-second audio clips. In particu - lar, we select dog sound, baby crying and gunshots as the tar- get events. These three events included in Audioset amount t o 13,460, 2,313 and 4,083 respectively, and we use all of them. In add...

work page

[7] [7]

Our compression scheme jointly applies knowledge distillation and quantization to the tar get model

Conclusion We study the model compression problem in the context of acoustic event detection. Our compression scheme jointly applies knowledge distillation and quantization to the tar get model. Experimental results show that the performance of sh al- low LSTM model can be greatly improved via knowledge dis- tillation without increase of size. The distill...

work page

[8] [8]

Automatic detection and classiﬁcation of aud io events for road surveillance applications,

N. Almaadeed, M. Asim, S. Al-maadeed, A. Bouridane, and A. Beghdadi, “Automatic detection and classiﬁcation of aud io events for road surveillance applications,” vol. 18, p. 185 8, 06 2018

work page 2018

[9] [9]

A Closer Look at Weak Label Learning for Audio Events

A. Shah, A. Kumar, A. G. Hauptmann, and B. Raj, “A closer look at weak label learning for audio events,” CoRR, vol. abs/1804.09288, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Cnn architectures for large-scale audio classiﬁcation,

S. Hershey, S. Chaudhuri, D. P . W. Ellis, J. F. Gemmeke, A. Jansen, C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Sey - bold, M. Slaney, R. Weiss, and K. Wilson, “Cnn architectures for large-scale audio classiﬁcation,” in ICASSP, 2017

work page 2017

[11] [11]

Con- volutional gated recurrent neural network incorporating s patial features for audio tagging,

Y . Xu, Q. Kong, Q. Huang, W. Wang, and M. D. Plumbley, “Con- volutional gated recurrent neural network incorporating s patial features for audio tagging,” in IJCNN, 2017

work page 2017

[12] [12]

Deep conv olu- tional neural networks and data augmentation for acoustic e vent detection,

N. Takahashi, M. Gygli, B. Pﬁster, and L. Gool, “Deep conv olu- tional neural networks and data augmentation for acoustic e vent detection,” CoRR, 2016

work page 2016

[13] [13]

Convolutional recurrent neur al net- works for rare sound event detection,

E. Cakir and T. Virtanen, “Convolutional recurrent neur al net- works for rare sound event detection,” in DCASE2017, pp. 27–31

work page

[14] [14]

Distilling the kno wledge in a neural network,

G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the kno wledge in a neural network,” in CoRR, 2015

work page 2015

[15] [15]

Distilling knowledge from ensembles of neural networks for speech recognition,

A. Waters and Y . Chebotar, “Distilling knowledge from ensembles of neural networks for speech recognition,” in Interspeech, 2016

work page 2016

[16] [16]

Knowledge distillation for small- footprint highway networks,

L. Lu, M. Guo, and S. Renals, “Knowledge distillation for small- footprint highway networks,” in ICASSP, 2017

work page 2017

[17] [17]

Compression of end-to-end models,

R. Pang, T.N.Sainath, R. Prabhavalkar, S. Gupta, Y . Wu, S. Zhang, and C. Chiu, “Compression of end-to-end models,” in Inter- speech, 2018

work page 2018

[18] [18]

Learn ing efﬁcient object detection models with knowledge distillat ion,

G. Chen, W. Choi, X. Y u, T. Han, and M. Chandraker, “Learn ing efﬁcient object detection models with knowledge distillat ion,” in NIPS, 2017

work page 2017

[19] [19]

Quantized neural networks: Training neural networks with low precision weights and activations,

I. Hubara, M. Courbariaux, D. Soudry, R. El-Y aniv, and Y . Ben- gio, “Quantized neural networks: Training neural networks with low precision weights and activations,” Journal of Machine Learning Research, vol. 18, 2018

work page 2018

[20] [20]

Effective Quantization Methods for Recurrent Neural Networks

Q. He, H. Wen, S. Zhou, Y . Wu, C. Y ao, X. Zhou, and Y . Zou, “Effective quantization methods for recurrent neural netw orks,” arXiv:1611.10176, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

On the efﬁ cient representation and execution of deep acoustic models,

R. Alvarez, R. Prabhavalkar, and A. Bakhtin, “On the efﬁ cient representation and execution of deep acoustic models,” in Inter- speech, 2016

work page 2016

[22] [22]

Model compress ion via distillation and quantization,

A. Polino, R. Pascanu, and D. Alistarh, “Model compress ion via distillation and quantization,” in ICLR, 2018

work page 2018

[23] [23]

Low-rank matrix factorization for deep neural network training with high-dimensional output targets,

T. N. Sainath, B. Kingsbury, V . Sindhwani, E. Arisoy, and B. Ram- abhadran, “Low-rank matrix factorization for deep neural network training with high-dimensional output targets,” in ICASSP, 2013

work page 2013

[24] [24]

On the compression of recurrent neural networks with an applic ation to lvcsr acoustic modeling for embedded speech recognition ,

R. Prabhavalkar, O. Alsharif, A. Bruguier, and I. McGra w, “On the compression of recurrent neural networks with an applic ation to lvcsr acoustic modeling for embedded speech recognition ,” in ICASSP, 2016

work page 2016

[25] [25]

Model compression applied to small-footprint keyword spotting,

G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. V ita- ladevuni, “Model compression applied to small-footprint keyword spotting,” Interspeech, 2016

work page 2016

[26] [26]

C om- pressed time delay neural network for small-footprint keyw ord spotting,

M. Sun, D. Snyder, Y . Gao, V . Nagaraja, M. Rodehorst, S. P an- chapagesan, N. Strom, S. Matsoukas, and S. Vitaladevuni, “C om- pressed time delay neural network for small-footprint keyw ord spotting,” Interspeech, 2017

work page 2017

[27] [27]

Reducing model complexity for dnn base d large-scale audio classiﬁcation,

Y . Wu and T. Lee, “Reducing model complexity for dnn base d large-scale audio classiﬁcation,” in ICASSP, 2018

work page 2018

[28] [28]

Compression of acoustic event detection models with low-r ank matrix factorization and quantization training,

B. Shi, M. Sun, C.-C. Kao, V . Rozgic, S. Matsoukas, and C. Wang, “Compression of acoustic event detection models with low-r ank matrix factorization and quantization training,” NeurIPS work- shop on Compact Deep Neural Networks with industrial appli- cations, 2018

work page 2018

[29] [29]

Teacher-stude nt train- ing for acoustic event detection using audioset,

R. Shi, R. W. M. Ng, and P . Swietojanski, “Teacher-stude nt train- ing for acoustic event detection using audioset,” ICASSP, 2019

work page 2019

[30] [30]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y . Bengio, N. Leonard, and A. Courville, “Estimating or propa- gating gradients through stochastic neurons for condition al com- putation,” CoRR, abs/1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[31] [31]

Deep residual learni ng for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learni ng for image recognition,” in CVPR, 2016

work page 2016

[32] [32]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. Ellis, D. Freedman, A. Jansen, W. Lawre nce, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017

work page 2017

[33] [33]

Densely con- nected convolutional networks,

G. Huang, Z. Liu, L. Maaten, and K. Weinberger, “Densely con- nected convolutional networks,” in CVPR, 2017

work page 2017