Acoustic Scene Classification Using Fusion of Attentive Convolutional Neural Networks for DCASE2019 Challenge
Pith reviewed 2026-05-24 21:45 UTC · model grok-4.3
The pith
Fusion of three attentive CNN topologies improves acoustic scene classification on the DCASE 2019 test set.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that fusing outputs from a VGG-like 2D CNN, an LCNN with max-feature-map activation, and an x-vector 1D CNN, each equipped with self-attention pooling and trained on 4-fold splits of log Mel-spectrogram features, produces the submitted systems for acoustic scene classification.
What carries the argument
Self-attention mechanism for statistic pooling applied inside each of the three CNN topologies (VGG-like, LCNN, x-vector) before fusion.
If this is right
- Different CNN topologies remain complementary even after each applies self-attention pooling.
- 4-fold training allows multiple models to be combined without additional labeled data.
- Fusion at the score level can raise accuracy on acoustic scenes not seen during training.
- Log Mel-spectrograms of 256 dimensions serve as sufficient input for all three topologies.
Where Pith is reading between the lines
- Attention pooling may reduce the need for hand-crafted segment-level features in audio classification.
- The same fusion approach could be tested on related tasks such as sound event detection.
- Results on the DCASE test set would indicate whether the complementarity holds for real-world recording variations.
Load-bearing premise
The three CNN topologies supply sufficiently complementary information when each uses self-attention pooling so that their fusion improves accuracy on the unseen DCASE test set.
What would settle it
An experiment that trains the three networks on the same 4-fold setup, fuses them, and shows no accuracy gain over the single best network on the official DCASE 2019 test set.
read the original abstract
In this report, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge are described. Also, the analysis of different methods is provided. The proposed approach is a fusion of three different Convolutional Neural Network (CNN) topologies. The first one is a VGG like two-dimensional CNNs. The second one is again a two-dimensional CNN network which uses Max-Feature-Map activation and called Light-CNN (LCNN). The third network is a one-dimensional CNN which mainly used for speaker verification and called x-vector topology. All proposed networks use self-attention mechanism for statistic pooling. As a feature, we use a 256-dimensional log Mel-spectrogram. Our submissions are a fusion of several networks trained on 4-folds generated evaluation setup using different fusion strategies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the Brno University of Technology (BUT) team's submissions to DCASE 2019 Task 1 (Acoustic Scene Classification). The central approach is a fusion of three CNN topologies—a VGG-like 2D CNN, a Light-CNN (LCNN) using Max-Feature-Map activation, and an x-vector 1D CNN—each employing self-attention for statistic pooling on 256-dimensional log-Mel spectrograms. All networks are trained with a 4-fold cross-validation split of the development data, and the submissions apply different fusion strategies; the text also states that an analysis of the methods is provided.
Significance. As a challenge technical report the work supplies a concrete engineering description of system configurations that participated in DCASE 2019. The uniform application of self-attention pooling across architecturally diverse backbones is a coherent design choice that may promote complementarity. Its primary value lies in documenting reproducible training and fusion protocols for the shared task rather than in establishing new methodological claims with quantified gains.
minor comments (2)
- [Abstract] Abstract: the statement that 'the analysis of different methods is provided' is not accompanied by any reference to specific sections, tables, or quantitative comparisons (e.g., per-network accuracies or fusion gains), making it impossible to locate the promised analysis.
- The manuscript does not enumerate the exact fusion strategies (score averaging, learned weights, etc.) or the number of component networks per submission, which are central to reproducing the described systems.
Simulated Author's Rebuttal
We thank the referee for the review of our DCASE 2019 Task 1 technical report and for the recommendation of minor revision. The report provides a concise summary of our fusion approach but lists no specific major comments requiring point-by-point rebuttal.
Circularity Check
No significant circularity identified
full rationale
The manuscript is a descriptive DCASE 2019 technical report detailing submitted systems (fusions of VGG-like, LCNN and x-vector networks with self-attention pooling on log-Mel features, trained via 4-fold splits of development data). No derivation chain, first-principles result, or mathematical prediction is claimed; the text is limited to configuration description and fusion strategies. Standard challenge practice of developing on dev folds does not constitute self-definitional or fitted-input circularity under the enumerated patterns, as no internal reduction of a claimed output to its own inputs occurs. The external challenge test set provides an independent benchmark.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The proposed approach is a fusion of three different Convolutional Neural Network (CNN) topologies... All proposed networks use self-attention mechanism for statistic pooling. As a feature, we use a 256-dimensional log Mel-spectrogram.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our submissions are a fusion of several networks trained on 4-folds generated evaluation setup using different fusion strategies.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
We proposed three different deep neural network topologies for this task
INTRODUCTION This report describes Brno University of Technology (BUT) t eam submissions for the ASC challenge of DCASE 2019. We proposed three different deep neural network topologies for this task. The first one is a VGG like [1] two-dimensional CNN network for process - ing audio segments. The second network is again a 2-dimensio nal CNN network which c...
work page 2019
-
[2]
DATASET In this challenge, an enhanced version of ASC dataset was used [11]. The dataset consists of recordings from 10 scene c lasses and was collected in 12 large European cities and in differen t en- vironments in each city. The development set of the dataset f or task1a consists of 1440 segments for each acoustic scene and in to- tal 40 hours of audio...
-
[3]
Features The log Mel-scale spectrogram was used as a feature in this ch al- lenge
DATA PROCESSING 3.1. Features The log Mel-scale spectrogram was used as a feature in this ch al- lenge. For extracting the features, first, we converted the a udio to a mono-channel and removed the amplitude bias by subtract - ing the audio segment’s mean from the signal. Then short time Fourier transform is computed on 2048 samples Hamming win- dowed fram...
work page 2048
-
[4]
The first one is a VGG like two-dimensional CNN
CNN TOPOLOGIES We have used three different CNN topologies for this challen ge. The first one is a VGG like two-dimensional CNN. The second topology is an enhanced version of Light-CNN (LCNN) which used Max-Feature-Map (MFM) as an additional non-linearity. MFM re- duces the number of kernels to half. As a result, the final netw ork has fewer parameters and ...
work page 2018
-
[5]
SYSTEMS AND FUSION In this challenge, we fused outputs of different networks to obtain the final results. First, we made a 4-folds cross-validation setup us- ing the whole development data in addition to the official pro vided setup (we only use the official validation set for report resu lts here and for the final system we only used the generated folds). By...
-
[6]
EXPERIMENTS AND RESULTS 6.1. Experimental Setups Similar to the baseline system provided by the organizers, o ur net- works training was performed by optimizing the categorical cross- entropy using Adam optimizer [15]. The initial learning rat e was set to 0.001 and the network training was early-stopped if the validation loss did not decrease for more th...
-
[7]
CONCLUSIONS We have described the systems submitted by BUT team to Acous- tic Scene Classification (ASC) challenge of DCASE2019. Diff erent systems were designed for this challenge and the final system s were fusions of the output scores from the individual system. A tr ained fusion as well as a majority vote fusion were used for the final sys- tem. The prop...
-
[8]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan and A. Zisserman, “V ery deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[9]
A light CNN for deep face representation with noisy labels,
X. Wu, R. He, Z. Sun, and T. Tan, “A light CNN for deep face representation with noisy labels,” IEEE Transactions on Information F orensics and Security, vol. 13, no. 11, pp. 2884– 2896, 2018
work page 2018
-
[10]
Audio replay at- tack detection with deep learning frameworks,
G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev, and V . Shchemelinin, “Audio replay at- tack detection with deep learning frameworks,” in Proc. Interspeech 2017 , 2017, pp. 82–86. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2017-360
-
[11]
H. Zeinali, T. Stafylakis, G. Athanasopoulou, J. Rohdin , I. Gkinis, L. Burget, and J. Cernocky, “Detecting spoofing at - tacks using vgg and sincnet: but-omilia submission to asvspoof 2019 challenge,” in Proc. Interspeech 2019, 2019
work page 2019
-
[12]
X-vectors: Robust DNN embeddings for speaker recognition,
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu - danpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 5329–5333
work page 2018
-
[13]
How to improve your speaker embeddings extrac- tor in generic toolkits,
H. Zeinali, L. Burget, J. Rohdin, T. Stafylakis, and J. H. Cer- nocky, “How to improve your speaker embeddings extrac- tor in generic toolkits,” in ICASSP 2019-2019 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6141–6145
work page 2019
-
[14]
H. Zeinali, L. Burget, and J. Cernocky, “Convolutional neural networks and x-vector embedding for DCASE2018 acoustic scene classification challenge,” arXiv preprint arXiv:1810.04273, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Self- attentive speaker embeddings for text-independent speake r verification,
Y . Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self- attentive speaker embeddings for text-independent speake r verification,” Proc. Interspeech 2018, pp. 3573–3577, 2018
work page 2018
-
[16]
Attentive Statistics Pooling for Deep Speaker Embedding
K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive stati s- tics pooling for deep speaker embedding,” arXiv preprint arXiv:1803.10963, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Attention-based models for text-dependent speaker verifi ca- tion,
F. R. rahman Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attention-based models for text-dependent speaker verifi ca- tion,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 5359–5363
work page 2018
-
[18]
A multi-devi ce dataset for urban acoustic scene classification,
A. Mesaros, T. Heittola, and T. Virtanen, “A multi-devi ce dataset for urban acoustic scene classification,” in IEEE AASP Challenge on Detection and Classification of Acoustic Scene s and Events (DCASE) , 2018
work page 2018
-
[19]
librosa: Audio and music sig- nal analysis in python,
B. McFee, C. Raffel, D. Liang, D. P . Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music sig- nal analysis in python,” in Proceedings of the 14th python in science conference, 2015, pp. 18–25. Detection and Classification of Acoustic Scenes and Events 2 019 Challenge
work page 2015
-
[20]
Acoustic scene classification with fully convolutional neural networks and I-vectors,
M. Dorfer, B. Lehner, H. Eghbal-zadeh, H. Christop, P . Fabian, and W. Gerhard, “Acoustic scene classification with fully convolutional neural networks and I-vectors,” DCASE2018 Challenge, Tech. Rep., September 2018
work page 2018
-
[21]
N. Br¨ ummer, “FoCal multi-class: Toolkit for evaluati on, fu- sion and calibration of multi-class recognition scorestut orial and user manual,” Software available at http://sites. google. com/site/nikobrummer/focalmulticlass, 2007
work page 2007
-
[22]
Adam: A Method for Stochastic Optimization
D. P . Kingma and J. Ba, “Adam: A method for stochastic op- timization,” arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.