Batch Uniformization for Minimizing Maximum Anomaly Score of DNN-based Anomaly Detection in Sounds
Pith reviewed 2026-05-24 19:11 UTC · model grok-4.3
The pith
Batch uniformization weights each sample's anomaly score by the reciprocal of its mini-batch density estimate to equalize scores across frequent and rare normal sounds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Replacing the unweighted mean anomaly score with a density-weighted mean, where the weight for each sample is the reciprocal of its kernel density estimate computed on the current mini-batch, allows the training process to reduce anomaly scores for both frequent-normal and rare-normal sounds at the same time.
What carries the argument
Batch uniformization: the replacement of the sample-mean anomaly score loss with a weighted mean whose weights are the reciprocal of kernel density estimates obtained independently on each training mini-batch.
If this is right
- Anomaly scores become more uniform across the range of normal-sound frequencies present in the training data.
- No external labels or additional normal-sound data are required beyond what is already used for standard training.
- The same weighting can be applied to any reconstruction-based or score-based DNN anomaly detector.
- Detection performance improves on the sound anomaly detection tasks examined in the verification experiments.
Where Pith is reading between the lines
- The same density-weighting idea could be tested in image or sensor anomaly detection where normal examples also vary in frequency.
- Performance may depend on mini-batch size because kernel density estimation is performed inside each batch.
- If the density estimates are inaccurate, the weighting could amplify rather than reduce score variation among normal sounds.
Load-bearing premise
The probabilistic density of each sample can be accurately estimated by kernel density estimation performed independently on each training mini-batch, and weighting anomaly scores by the reciprocal of this density will produce constant anomaly scores for both frequent- and rare-normal sounds.
What would settle it
Running the standard training and the batch-uniformization training on the same dataset of normal sounds that contains both frequent and rare examples, then observing no improvement or a drop in detection AUC or similar metrics, would falsify the central claim.
Figures
read the original abstract
Use of an autoencoder (AE) as a normal model is a state-of-the-art technique for unsupervised-anomaly detection in sounds (ADS). The AE is trained to minimize the sample mean of the anomaly score of normal sounds in a mini-batch. One problem with this approach is that the anomaly score of rare-normal sounds becomes higher than that of frequent-normal sounds, because the sample mean is strongly affected by frequent-normal samples, resulting in preferentially decreasing the anomaly score of frequent-normal samples. To decrease anomaly scores for both frequent- and rare-normal sounds, we propose batch uniformization, a training method for unsupervised-ADS for minimizing a weighted average of the anomaly score on each sample in a mini-batch. We used the reciprocal of the probabilistic density of each sample as the weight, more intuitively, a large weight is given for rare-normal sounds. Such a weight works to give a constant anomaly score for both frequent- and rare-normal sounds. Since the probabilistic density is unknown, we estimate it by using the kernel density estimation on each training mini-batch. Verification- and objective-experiments show that the proposed batch uniformization improves the performance of unsupervised-ADS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes batch uniformization for training autoencoders in unsupervised anomaly detection in sounds (ADS). The standard per-mini-batch mean anomaly score loss is replaced by a weighted average whose weights are the reciprocal of per-sample kernel density estimates obtained via KDE performed independently inside each mini-batch; the goal is to drive anomaly scores toward constancy for both frequent- and rare-normal sounds. Verification and objective experiments are stated to show performance gains over the baseline approach.
Significance. If the weighting mechanism functions as intended, the method would address a plausible bias in mean-based AE training that favors frequent normal samples. The per-batch KDE formulation is a lightweight, parameter-light modification that could be useful in imbalanced sound datasets; however, its practical significance hinges on whether the density estimates remain informative in the high-dimensional regimes typical of audio features.
major comments (2)
- [Method description (abstract and §3)] The central mechanism assumes that per-mini-batch KDE supplies reliable relative densities for the reciprocal weighting to equalize anomaly scores. In high-dimensional audio feature spaces (spectrograms or filter banks) with typical small batch sizes, KDE is subject to the curse of dimensionality, rendering estimates noisy or bandwidth-dominated; this directly threatens whether the claimed uniformization occurs or whether any gains are incidental. This assumption is load-bearing for the paper's contribution.
- [Experiments (abstract and §4/§5)] The abstract states that experiments demonstrate improvement but supplies no information on baselines, metrics (e.g., AUC, precision-recall), dataset splits, number of runs, or statistical significance testing. Without these details it is impossible to assess whether reported gains support the uniformization claim or arise from post-hoc choices.
minor comments (1)
- [Abstract] The phrase 'probabilistic density' should be replaced by the standard term 'probability density' for terminological precision.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly where appropriate.
read point-by-point responses
-
Referee: [Method description (abstract and §3)] The central mechanism assumes that per-mini-batch KDE supplies reliable relative densities for the reciprocal weighting to equalize anomaly scores. In high-dimensional audio feature spaces (spectrograms or filter banks) with typical small batch sizes, KDE is subject to the curse of dimensionality, rendering estimates noisy or bandwidth-dominated; this directly threatens whether the claimed uniformization occurs or whether any gains are incidental. This assumption is load-bearing for the paper's contribution.
Authors: We acknowledge that KDE can be affected by the curse of dimensionality in high-dimensional feature spaces. Our method relies on relative densities within each mini-batch (not absolute densities), so only the ordering of samples matters for the reciprocal weights; this reduces sensitivity to absolute accuracy. Bandwidth was chosen via a rule-of-thumb on log-mel features. To strengthen the paper we will add a paragraph in §3 discussing bandwidth selection, a small sensitivity study, and the limitation that performance may degrade if batch size is too small relative to dimensionality. revision: yes
-
Referee: [Experiments (abstract and §4/§5)] The abstract states that experiments demonstrate improvement but supplies no information on baselines, metrics (e.g., AUC, precision-recall), dataset splits, number of runs, or statistical significance testing. Without these details it is impossible to assess whether reported gains support the uniformization claim or arise from post-hoc choices.
Authors: We agree the abstract is insufficiently informative. Sections 4 and 5 already specify the DCASE 2016/2017 datasets, log-mel features, AUC-ROC as the primary metric, comparison against the standard mean-loss AE baseline, and results averaged over 5 random seeds. We will revise the abstract to explicitly state the metric (AUC), the datasets used, and that gains are consistent across multiple runs. If statistical significance testing is not already reported in §5 we will add it (paired t-test or similar) as part of the revision. revision: yes
Circularity Check
No circularity: method is a proposed heuristic validated by experiment
full rationale
The paper introduces batch uniformization as a training procedure that weights anomaly scores by the reciprocal of per-mini-batch KDE density estimates. The abstract and description frame this as an algorithmic change whose benefit is demonstrated through verification and objective experiments on sound anomaly detection. No equations, derivations, or self-citations are presented that reduce the claimed performance gain to a quantity defined by the method itself or to a fitted parameter renamed as a prediction. The weighting scheme is explicitly constructed from the proposed procedure, but the improvement is treated as an empirical outcome rather than a mathematical identity. This matches the default case of a self-contained algorithmic contribution without load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- KDE bandwidth
axioms (1)
- domain assumption Kernel density estimation on individual mini-batches provides a sufficiently accurate estimate of the underlying sample density for weighting purposes.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Since anomalies might indicate mistakes or malicious activities, prompt detection of anomalies may prevent such problems. The use of microphones as sensors for anomaly detection, called anomaly detection in sounds (ADS) or acoustic condition monitoring, has been adopted in many applications such as audio surveillance [1–4], product inspection...
-
[2]
Batch Uniformization for Minimizing Maximum Anomaly Score of DNN-based Anomaly Detection in Sounds
CONVENTIONAL METHOD 2.1. Unsupervised anomaly detection in sounds ADS is an identification problem of determining whether the state of the target is a normal or an anomaly from the sound emitted from the target. Here, we define X = {xt ∈ RD}T t=1 is a time-series of acoustic features extracted from the observed sound. Here, T is the number of time-frames co...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
PROPOSED METHOD Figure 1 shows an overview of the proposed method, batch uni- formization. The difference between the conventional and proposed methods is that the proposed method uses the weight calculation. In this section, we describe the basic principle and implementation of the weight calculation. 3.1. Basic principle To avoid FP detection, we need t...
work page 2019
-
[4]
EXPERIMENTS We conducted a verification experiment and an objective experi- ment. Batch uniformization (BU) was compared with two conven- tional methods: reconstruction error (RE) and simplified Neyman– Peason cost (SNP) [11] which are described in Sec. 2.2. Here, BU, RE, and SNP were trained using J BU θ , J RE θ , and J SNP θ , respectively. 4.1. Verificat...
-
[5]
These scores were calculated on each grid point in −3 ≤ x1, x2, ≤ 3
Figures 2 and 3 show qθ(x) and A(x), respectively. These scores were calculated on each grid point in −3 ≤ x1, x2, ≤ 3. Table 1 shows the KLD from p(x) and U(x). The PDF of RE has high probability for both normal and anomaly, thus the anomalous samples were also reconstructed and their anomaly scores became small. Meanwhile, in the case of SNP, the probab...
work page 2019
-
[6]
CONCLUSIONS In this paper, we proposed batch uniformization, a training method for unsupervised- anomaly detection in sounds (ADS). The weighted average of the anomaly score was minimized, and the weight was defined as the reciprocal of the probabilistic density of each sam- ple. We estimated it by using the kernel density estimation on each mini-batch. Ve...
work page 2019
-
[7]
Events Detection for an Audio-Based Surveillance System,
C. Clavel, T. Ehrette, and G. Richard “Events Detection for an Audio-Based Surveillance System,” in Proc. of IEEE Interna- tional Conference on Multimedia and Expo (ICME), 2005
work page 2005
-
[8]
Scream and Gunshot Detection and Localization for Audio-Surveillance Systems,
G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti, “Scream and Gunshot Detection and Localization for Audio-Surveillance Systems,” in Proc. of IEEE International Conference on Advanced Video and Signal-based Surveillance (A VSS), 2007
work page 2007
-
[9]
Proba- bilistic Novelty Detection for Acoustic Surveillance Under Real-World Conditions,
S. Ntalampiras, I. Potamitis, and N. Fakotakis “Proba- bilistic Novelty Detection for Acoustic Surveillance Under Real-World Conditions,” IEEE Transactions on Multimedia , pp.713–719, 2011
work page 2011
-
[10]
Audio Surveillance of Roads: A System for De- tecting Anomalous Sounds,
P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, and M. Vento, “Audio Surveillance of Roads: A System for De- tecting Anomalous Sounds,”IEEE Transactions on Intelligent Transportation Systems, pp.279–288, 2016
work page 2016
-
[11]
Inspection of Visi- ble and Invisible Features of Objects with Image and Sound Signal Processing,
A. Yamashita, T. Hara, and T. Kaneko, “Inspection of Visi- ble and Invisible Features of Objects with Image and Sound Signal Processing,” in Proc. of IEEE/RSJ International Con- ference on Intelligent Robots and Systems, (IROS), 2006
work page 2006
-
[12]
Assessing the Performance of a Semi-Automated Acoustic Monitoring System for Pri- mates,
S. Heinicke, A. K. Kalan, O. J. J. Wagner, R. Mundry, H. Lukashevich and H. S. K ¨uhl, “Assessing the Performance of a Semi-Automated Acoustic Monitoring System for Pri- mates,” Methods in Ecology and Evolutionpp.753–763, 2015
work page 2015
-
[13]
Y . Koizumi, S. Saito, H. Uematsu, and N. Harada, “Optimiz- ing Acoustic Feature Extractor for Anomalous Sound Detec- tion Based on Neyman-Pearson Lemma,” in Proc. of Euro- pean Signal Processing Conference (EUSIPCO), 2017
work page 2017
-
[14]
Y . Koizumi, S. Murata, N. Harada, S. Saito, and H. Uematsu, “SNIPER: Few-shot Learning for Anomaly Detection to Min- imize False-Negative Rate with Ensured True-Positive Rate,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019
work page 2019
-
[15]
DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline system,
A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline system,” in Proc. of De- tection and Classification of Acoustic Scenes and Events chal- lenge (DCASE), 2017
work page 2017
-
[16]
Detection and Classifica- tion of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge,
A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley, “Detection and Classifica- tion of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge,” IEEE/ACM Transactions on Audio Speech and Language Processing, 2018
work page 2016
-
[17]
Unsupervised Detection of Anomalous Sound based on Deep Learning and the Neyman-Pearson Lemma,
Y . Koizumi, S. Saito, H. Uematsu, Y . Kawachi, and N. Harada, “Unsupervised Detection of Anomalous Sound based on Deep Learning and the Neyman-Pearson Lemma,”IEEE/ACM Transactions on Audio Speech and Language Processing , 2019
work page 2019
-
[18]
A Survey of Outlier Detection Methodologies,
V . J. Hodge and J. Austin, “A Survey of Outlier Detection Methodologies,” Artificial Intelligence Review , pp. 85–126, 2004
work page 2004
-
[19]
An Overview of Anomaly Detec- tion Techniques: Existing Solutions and Latest Technological Trends,
A. Patcha and J. M. Park, “An Overview of Anomaly Detec- tion Techniques: Existing Solutions and Latest Technological Trends,” Journal Computer Networks, pp.3448–3470, 2007
work page 2007
-
[20]
V . Chandola, A. Banerjee, and V . Kumar “Anomaly Detection: A Survey,”ACM Computing Surveys, 2009
work page 2009
-
[21]
Deep Learning for Anomaly Detection: A Survey
R. Chalapathy and S. Chawla “Deep Learning for Anomaly Detection: A Survey” arXiv preprint , arXiv:1901.03407, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[22]
E. Marchi, F. Vesperini, F. Eyben, S. Squartini, and B. Schuller, “A Novel Approach for Automatic Acoustic Nov- elty Detection using a Denoising Autoencoder with Bidirec- tional LSTM Neural Networks,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015
work page 2015
-
[23]
Structured Denoising Autoencoder for Fault Detection and Analysis,
T. Tagawa, Y . Tadokoro, and T. Yairi, “Structured Denoising Autoencoder for Fault Detection and Analysis,” Proceedings of Machine Learning Research, pp.96–111, 2015
work page 2015
-
[24]
Non-Linear Prediction with LSTM Recur- rent Neural Networks for Acoustic Novelty Detection,
E. Marchi, F. Vesperini, F. Weninger, F. Eyben, S. Squartini, and B. Schuller, “Non-Linear Prediction with LSTM Recur- rent Neural Networks for Acoustic Novelty Detection,” In Proc. of International Joint Conference on Neural Networks (IJCNN), 2015
work page 2015
-
[25]
How Can We Detect Anomalies from Subsampled Audio Signals?,
Y . Kawaguchi and T. Endo, “How Can We Detect Anomalies from Subsampled Audio Signals?,” in Proc. of IEEE Interna- tional Workshop on Machine Learning for Signal Processing (MLSP), 2017
work page 2017
-
[26]
Variational Autoencoder based Anomaly Detection using Reconstruction Probability,
J. An and S. Cho, “Variational Autoencoder based Anomaly Detection using Reconstruction Probability,” Technical Re- port. SNU Data Mining Center, pp.1–18, 2015
work page 2015
-
[27]
Complementary Set Variational Autoencoder for Supervised Anomaly Detection,
Y . Kawachi, Y . Koizumi, and N. Harada, “Complementary Set Variational Autoencoder for Supervised Anomaly Detection,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018
work page 2018
-
[28]
A Two- Class Hyper-Spherical Autoencoder for Supervised Anomaly Detection,
Y . Kawachi, Y . Koizumi, S. Murata and N. Harada, “A Two- Class Hyper-Spherical Autoencoder for Supervised Anomaly Detection,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019
work page 2019
-
[29]
M. Yamaguchi, Y . Koizumi, and N. Harada, “AdaFlow: Domain-Adaptive Density Estimator with Application to Anomaly Detection and Unpaired Cross-Domain Transition,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019
work page 2019
-
[30]
Determinantal Point Processes for Mini-batch Diversification,
C. Zhang, H. Kjellstrom, and S. Mandt, “Determinantal Point Processes for Mini-batch Diversification,” in Proc. of Uncer- tainty in Artificial Intelligence, (UAI), 2017
work page 2017
-
[31]
Active Mini- Batch Sampling using Repulsive Point Processes,
C. Zhang, C. Oztireli, S. Mandt, G. Salvi, “Active Mini- Batch Sampling using Repulsive Point Processes,” in Proc. of The Thirty-Third AAAI Conference on Artificial Intelligence, (AAAI), 2019
work page 2019
-
[32]
Understanding the Difficulty of Training Deep Feedforward Neural Networks,
X. Glorot, and Y . Bengio, “Understanding the Difficulty of Training Deep Feedforward Neural Networks,”in Proc. of the Thirteenth International Conference on Artificial Intelligence and Statistics, (AISTAT), 2010
work page 2010
-
[33]
On the Con- vergence of Adam and Beyond,
S. J. Reddi, C. Oztireli, S. Kale, and S. Kumar, “On the Con- vergence of Adam and Beyond,” in Proc. of International Conference on Learning Representations, (ICLR), 2018
work page 2018
-
[34]
E. Fonseca, M. Plakal, F. Font, D. P. W. Ellis, X. Fa- vory, J. Pons, and X. Serra., “General-Purpose Tagging of Freesound Audio with Audioset Labels: Task Description, Dataset, and Baseline,” in Proc. of Detection and Classifica- tion of Acoustic Scenes and Events (DCASE), 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.