pith. sign in

arxiv: 1907.00873 · v1 · pith:OP5GPKP7new · submitted 2019-07-01 · 📡 eess.AS · cs.LG· cs.SD

Compression of Acoustic Event Detection Models With Quantized Distillation

Pith reviewed 2026-05-25 11:19 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD
keywords acoustic event detectionmodel compressionknowledge distillationquantizationdeep neural networkscompact modelsedge deployment
0
0 comments X

The pith

Combining distillation and quantization compresses large acoustic event detection models to 2% of teacher size while reducing error rates by 15%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that a joint approach of knowledge distillation followed by quantization can turn large teacher networks for acoustic event detection into compact student models that are both smaller and more accurate. This matters because current high-performing AED models demand too much computation to run on everyday devices. Distillation first improves the student's error rate over an ordinary compact network, then quantization delivers the bulk of the size cut. If the results hold, accurate event detection becomes feasible on hardware with tight memory and power limits.

Core claim

The paper claims that jointly leveraging knowledge distillation and quantization compresses a larger teacher model into a compact student model for acoustic event detection. This lowers the error rate of the original compact network by 15% through distillation and reduces model size to 2% of the teacher and 12% of the full-precision student through quantization.

What carries the argument

Joint knowledge distillation from a teacher AED model to a student model followed by quantization of the student.

If this is right

  • The resulting student models fit on devices with limited memory and compute.
  • Detection accuracy improves relative to an uncompressed compact baseline.
  • Memory footprint drops to roughly one-fiftieth of the original teacher model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage compression could be tried on related audio tasks such as sound classification or keyword spotting.
  • Quantizing after distillation may preserve more accuracy than quantizing first and then distilling.
  • Measuring inference latency on actual embedded hardware would show whether the size cut translates into usable speed gains.

Load-bearing premise

The 15% error reduction and extreme size savings seen with the chosen teacher-student pair, datasets, and quantization scheme will appear in other settings.

What would settle it

Applying the same distillation-then-quantization pipeline to a different AED dataset or architecture and measuring no drop in error rate or no reduction below the full-precision student size.

read the original abstract

Acoustic Event Detection (AED), aiming at detecting categories of events based on audio signals, has found application in many intelligent systems. Recently deep neural network significantly advances this field and reduces detection errors to a large scale. However how to efficiently execute deep models in AED has received much less attention. Meanwhile state-of-the-art AED models are based on large deep models, which are computational demanding and challenging to deploy on devices with constrained computational resources. In this paper, we present a simple yet effective compression approach which jointly leverages knowledge distillation and quantization to compress larger network (teacher model) into compact network (student model). Experimental results show proposed technique not only lowers error rate of original compact network by 15% through distillation but also further reduces its model size to a large extent (2% of teacher, 12% of full-precision student) through quantization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a joint knowledge-distillation-plus-quantization procedure to compress a large teacher acoustic-event-detection network into a compact student. The central empirical claim, stated in the abstract, is that distillation alone reduces the student's error rate by 15 % relative to an undistilled compact baseline, while subsequent quantization shrinks the student to 2 % of the teacher's size and 12 % of the full-precision student's size.

Significance. If the numerical claims are reproducible and generalize, the result would be useful for deploying AED models on resource-limited hardware; the combination of distillation and quantization is a standard compression recipe whose joint application to this task has not been widely reported. The manuscript supplies no machine-checked proofs, open code, or parameter-free derivations, so its contribution rests entirely on the strength of the (currently unreported) experiments.

major comments (3)
  1. [Abstract] Abstract: the 15 % error-rate reduction is stated without absolute baseline or teacher error rates, without dataset identity or size, without bit-width or quantization scheme, and without error bars or number of runs. These omissions make the central numerical claim impossible to evaluate or reproduce.
  2. [Abstract] Abstract: the phrase 'quantized distillation' is introduced without any description of the training procedure (post-hoc quantization, quantization-aware training, joint loss, etc.). Because the size-reduction numbers (2 % / 12 %) depend on this choice, the interaction between the two techniques cannot be assessed.
  3. [Abstract] The manuscript contains no equations, algorithm box, or pseudocode that would allow a reader to implement the claimed joint procedure; the entire contribution is therefore carried by the experimental section, which is not described in the supplied text.
minor comments (2)
  1. [Abstract] Abstract, sentence 3: 'deep neural network significantly advances' should read 'deep neural networks have significantly advanced'.
  2. [Abstract] Abstract, sentence 4: 'computational demanding' should read 'computationally demanding'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each major comment below and will revise the manuscript to improve the abstract's clarity and reproducibility.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 15 % error-rate reduction is stated without absolute baseline or teacher error rates, without dataset identity or size, without bit-width or quantization scheme, and without error bars or number of runs. These omissions make the central numerical claim impossible to evaluate or reproduce.

    Authors: We agree that the abstract would benefit from additional context. In the revised version we will report the absolute error rates for the teacher and the undistilled compact baseline, identify the dataset and its size, specify the quantization bit-width and scheme, and indicate the number of runs or variability. revision: yes

  2. Referee: [Abstract] Abstract: the phrase 'quantized distillation' is introduced without any description of the training procedure (post-hoc quantization, quantization-aware training, joint loss, etc.). Because the size-reduction numbers (2 % / 12 %) depend on this choice, the interaction between the two techniques cannot be assessed.

    Authors: We agree a brief description is warranted. The revised abstract will state that the method performs joint training with a combined knowledge-distillation and quantization loss. revision: yes

  3. Referee: [Abstract] The manuscript contains no equations, algorithm box, or pseudocode that would allow a reader to implement the claimed joint procedure; the entire contribution is therefore carried by the experimental section, which is not described in the supplied text.

    Authors: The full manuscript contains an experimental section with implementation details. To address the concern we will add a short algorithm box or pseudocode outlining the joint procedure in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivation chain

full rationale

The paper describes an empirical method combining knowledge distillation and quantization for model compression in acoustic event detection. The abstract and available text contain no equations, derivations, fitted parameters presented as predictions, or self-citations that bear load on a central claim. Results are reported as experimental outcomes (error rate reduction, size ratios) without any reduction to self-defined quantities or ansatzes. This matches the default expectation of a non-circular empirical paper; no steps qualify under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities are described; the contribution is an empirical compression recipe.

pith-pipeline@v0.9.0 · 5691 in / 1006 out tokens · 22863 ms · 2026-05-25T11:19:37.889639+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

  1. [1]

    In surveillance systems, audio i s used either independently or in conjunction with visual mod al- ity for scene analysis

    Introduction Acoustic event detection (AED), the task of detecting the occur- rence of certain events based on audio streams, can be widely applied in many scenarios. In surveillance systems, audio i s used either independently or in conjunction with visual mod al- ity for scene analysis. For example, [1] applies AED model to detect hazardous events and p...

  2. [2]

    Related work Neural network compression has been well explored in broad context. Knowledge distillation [7] is a commonly used tech - nique for model compression, which consists of training a co m- pact student network with distilled knowledge from a large teacher network. Knowledge distillation has been widely ap - plied in various domains, including aut...

  3. [3]

    [21] combines quantization and low-ran k matrix factorization technique to compress multi-layer re cur- rent neural network

    investigates compression of CNNs and their method is on simplification of architectures by introducing bottleneck layers and global pooling. [21] combines quantization and low-ran k matrix factorization technique to compress multi-layer re cur- rent neural network. In [22] knowledge distillation is appl ied to train CNNs of small footprint. This paper focu...

  4. [4]

    Given an audio signal I (e.g

    Methods We start by formulating the multi-class acoustic event dete ction problem. Given an audio signal I (e.g. log mel-filter bank ener- gies (LFBEs)), the task is to train a model f to predict a multi- hot vector y ∈ { 0, 1}C , with C being the size of event set E , and yc being a binary indicator whether event c is present in I. Note the prediction f (...

  5. [5]

    Compared to CNNs, RNN has folllowing advantages: (1)

    is ResNet [24] with 50 layers. Compared to CNNs, RNN has folllowing advantages: (1). It is more compact and induc es much less computation compared to a deep CNN (see table 2 in experimental section for detailed comparison) (2). For C NN, entire sequence of raw input and its sub-sampled feature se- quence of each intermediate layer have to be stored in me...

  6. [6]

    Experimental Setting Data The dataset we use is a subset from Audioset [25], which contains a large amount of 10-second audio clips

    Experiments 4.1. Experimental Setting Data The dataset we use is a subset from Audioset [25], which contains a large amount of 10-second audio clips. In particu - lar, we select dog sound, baby crying and gunshots as the tar- get events. These three events included in Audioset amount t o 13,460, 2,313 and 4,083 respectively, and we use all of them. In add...

  7. [7]

    Our compression scheme jointly applies knowledge distillation and quantization to the tar get model

    Conclusion We study the model compression problem in the context of acoustic event detection. Our compression scheme jointly applies knowledge distillation and quantization to the tar get model. Experimental results show that the performance of sh al- low LSTM model can be greatly improved via knowledge dis- tillation without increase of size. The distill...

  8. [8]

    Automatic detection and classification of aud io events for road surveillance applications,

    N. Almaadeed, M. Asim, S. Al-maadeed, A. Bouridane, and A. Beghdadi, “Automatic detection and classification of aud io events for road surveillance applications,” vol. 18, p. 185 8, 06 2018

  9. [9]

    A Closer Look at Weak Label Learning for Audio Events

    A. Shah, A. Kumar, A. G. Hauptmann, and B. Raj, “A closer look at weak label learning for audio events,” CoRR, vol. abs/1804.09288, 2018

  10. [10]

    Cnn architectures for large-scale audio classification,

    S. Hershey, S. Chaudhuri, D. P . W. Ellis, J. F. Gemmeke, A. Jansen, C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Sey - bold, M. Slaney, R. Weiss, and K. Wilson, “Cnn architectures for large-scale audio classification,” in ICASSP, 2017

  11. [11]

    Con- volutional gated recurrent neural network incorporating s patial features for audio tagging,

    Y . Xu, Q. Kong, Q. Huang, W. Wang, and M. D. Plumbley, “Con- volutional gated recurrent neural network incorporating s patial features for audio tagging,” in IJCNN, 2017

  12. [12]

    Deep conv olu- tional neural networks and data augmentation for acoustic e vent detection,

    N. Takahashi, M. Gygli, B. Pfister, and L. Gool, “Deep conv olu- tional neural networks and data augmentation for acoustic e vent detection,” CoRR, 2016

  13. [13]

    Convolutional recurrent neur al net- works for rare sound event detection,

    E. Cakir and T. Virtanen, “Convolutional recurrent neur al net- works for rare sound event detection,” in DCASE2017, pp. 27–31

  14. [14]

    Distilling the kno wledge in a neural network,

    G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the kno wledge in a neural network,” in CoRR, 2015

  15. [15]

    Distilling knowledge from ensembles of neural networks for speech recognition,

    A. Waters and Y . Chebotar, “Distilling knowledge from ensembles of neural networks for speech recognition,” in Interspeech, 2016

  16. [16]

    Knowledge distillation for small- footprint highway networks,

    L. Lu, M. Guo, and S. Renals, “Knowledge distillation for small- footprint highway networks,” in ICASSP, 2017

  17. [17]

    Compression of end-to-end models,

    R. Pang, T.N.Sainath, R. Prabhavalkar, S. Gupta, Y . Wu, S. Zhang, and C. Chiu, “Compression of end-to-end models,” in Inter- speech, 2018

  18. [18]

    Learn ing efficient object detection models with knowledge distillat ion,

    G. Chen, W. Choi, X. Y u, T. Han, and M. Chandraker, “Learn ing efficient object detection models with knowledge distillat ion,” in NIPS, 2017

  19. [19]

    Quantized neural networks: Training neural networks with low precision weights and activations,

    I. Hubara, M. Courbariaux, D. Soudry, R. El-Y aniv, and Y . Ben- gio, “Quantized neural networks: Training neural networks with low precision weights and activations,” Journal of Machine Learning Research, vol. 18, 2018

  20. [20]

    Effective Quantization Methods for Recurrent Neural Networks

    Q. He, H. Wen, S. Zhou, Y . Wu, C. Y ao, X. Zhou, and Y . Zou, “Effective quantization methods for recurrent neural netw orks,” arXiv:1611.10176, 2016

  21. [21]

    On the effi cient representation and execution of deep acoustic models,

    R. Alvarez, R. Prabhavalkar, and A. Bakhtin, “On the effi cient representation and execution of deep acoustic models,” in Inter- speech, 2016

  22. [22]

    Model compress ion via distillation and quantization,

    A. Polino, R. Pascanu, and D. Alistarh, “Model compress ion via distillation and quantization,” in ICLR, 2018

  23. [23]

    Low-rank matrix factorization for deep neural network training with high-dimensional output targets,

    T. N. Sainath, B. Kingsbury, V . Sindhwani, E. Arisoy, and B. Ram- abhadran, “Low-rank matrix factorization for deep neural network training with high-dimensional output targets,” in ICASSP, 2013

  24. [24]

    On the compression of recurrent neural networks with an applic ation to lvcsr acoustic modeling for embedded speech recognition ,

    R. Prabhavalkar, O. Alsharif, A. Bruguier, and I. McGra w, “On the compression of recurrent neural networks with an applic ation to lvcsr acoustic modeling for embedded speech recognition ,” in ICASSP, 2016

  25. [25]

    Model compression applied to small-footprint keyword spotting,

    G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. V ita- ladevuni, “Model compression applied to small-footprint keyword spotting,” Interspeech, 2016

  26. [26]

    C om- pressed time delay neural network for small-footprint keyw ord spotting,

    M. Sun, D. Snyder, Y . Gao, V . Nagaraja, M. Rodehorst, S. P an- chapagesan, N. Strom, S. Matsoukas, and S. Vitaladevuni, “C om- pressed time delay neural network for small-footprint keyw ord spotting,” Interspeech, 2017

  27. [27]

    Reducing model complexity for dnn base d large-scale audio classification,

    Y . Wu and T. Lee, “Reducing model complexity for dnn base d large-scale audio classification,” in ICASSP, 2018

  28. [28]

    Compression of acoustic event detection models with low-r ank matrix factorization and quantization training,

    B. Shi, M. Sun, C.-C. Kao, V . Rozgic, S. Matsoukas, and C. Wang, “Compression of acoustic event detection models with low-r ank matrix factorization and quantization training,” NeurIPS work- shop on Compact Deep Neural Networks with industrial appli- cations, 2018

  29. [29]

    Teacher-stude nt train- ing for acoustic event detection using audioset,

    R. Shi, R. W. M. Ng, and P . Swietojanski, “Teacher-stude nt train- ing for acoustic event detection using audioset,” ICASSP, 2019

  30. [30]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Y . Bengio, N. Leonard, and A. Courville, “Estimating or propa- gating gradients through stochastic neurons for condition al com- putation,” CoRR, abs/1308.3432, 2013

  31. [31]

    Deep residual learni ng for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learni ng for image recognition,” in CVPR, 2016

  32. [32]

    Audio set: An ontology and human-labeled dataset for audio events,

    J. F. Gemmeke, D. Ellis, D. Freedman, A. Jansen, W. Lawre nce, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017

  33. [33]

    Densely con- nected convolutional networks,

    G. Huang, Z. Liu, L. Maaten, and K. Weinberger, “Densely con- nected convolutional networks,” in CVPR, 2017