pith. sign in

arxiv: 1907.07127 · v1 · pith:CHCXDEVInew · submitted 2019-07-13 · 📡 eess.AS · cs.SD

Acoustic Scene Classification Using Fusion of Attentive Convolutional Neural Networks for DCASE2019 Challenge

Pith reviewed 2026-05-24 21:45 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords acoustic scene classificationconvolutional neural networksself-attention poolingDCASE 2019network fusionlog Mel-spectrogramVGGx-vector
0
0 comments X

The pith

Fusion of three attentive CNN topologies improves acoustic scene classification on the DCASE 2019 test set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system for acoustic scene classification that combines three convolutional neural network architectures. A VGG-like 2D CNN, a Light-CNN using max-feature-map activation, and a 1D x-vector network each process 256-dimensional log Mel-spectrograms and apply self-attention for statistic pooling. Networks are trained separately on a 4-fold evaluation setup, then combined using multiple fusion strategies. The resulting submissions are entered in Task 1 of the DCASE 2019 challenge. The central idea is that the different topologies capture complementary information when attention pooling is used in each.

Core claim

The authors establish that fusing outputs from a VGG-like 2D CNN, an LCNN with max-feature-map activation, and an x-vector 1D CNN, each equipped with self-attention pooling and trained on 4-fold splits of log Mel-spectrogram features, produces the submitted systems for acoustic scene classification.

What carries the argument

Self-attention mechanism for statistic pooling applied inside each of the three CNN topologies (VGG-like, LCNN, x-vector) before fusion.

If this is right

  • Different CNN topologies remain complementary even after each applies self-attention pooling.
  • 4-fold training allows multiple models to be combined without additional labeled data.
  • Fusion at the score level can raise accuracy on acoustic scenes not seen during training.
  • Log Mel-spectrograms of 256 dimensions serve as sufficient input for all three topologies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attention pooling may reduce the need for hand-crafted segment-level features in audio classification.
  • The same fusion approach could be tested on related tasks such as sound event detection.
  • Results on the DCASE test set would indicate whether the complementarity holds for real-world recording variations.

Load-bearing premise

The three CNN topologies supply sufficiently complementary information when each uses self-attention pooling so that their fusion improves accuracy on the unseen DCASE test set.

What would settle it

An experiment that trains the three networks on the same 4-fold setup, fuses them, and shows no accuracy gain over the single best network on the official DCASE 2019 test set.

read the original abstract

In this report, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge are described. Also, the analysis of different methods is provided. The proposed approach is a fusion of three different Convolutional Neural Network (CNN) topologies. The first one is a VGG like two-dimensional CNNs. The second one is again a two-dimensional CNN network which uses Max-Feature-Map activation and called Light-CNN (LCNN). The third network is a one-dimensional CNN which mainly used for speaker verification and called x-vector topology. All proposed networks use self-attention mechanism for statistic pooling. As a feature, we use a 256-dimensional log Mel-spectrogram. Our submissions are a fusion of several networks trained on 4-folds generated evaluation setup using different fusion strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript describes the Brno University of Technology (BUT) team's submissions to DCASE 2019 Task 1 (Acoustic Scene Classification). The central approach is a fusion of three CNN topologies—a VGG-like 2D CNN, a Light-CNN (LCNN) using Max-Feature-Map activation, and an x-vector 1D CNN—each employing self-attention for statistic pooling on 256-dimensional log-Mel spectrograms. All networks are trained with a 4-fold cross-validation split of the development data, and the submissions apply different fusion strategies; the text also states that an analysis of the methods is provided.

Significance. As a challenge technical report the work supplies a concrete engineering description of system configurations that participated in DCASE 2019. The uniform application of self-attention pooling across architecturally diverse backbones is a coherent design choice that may promote complementarity. Its primary value lies in documenting reproducible training and fusion protocols for the shared task rather than in establishing new methodological claims with quantified gains.

minor comments (2)
  1. [Abstract] Abstract: the statement that 'the analysis of different methods is provided' is not accompanied by any reference to specific sections, tables, or quantitative comparisons (e.g., per-network accuracies or fusion gains), making it impossible to locate the promised analysis.
  2. The manuscript does not enumerate the exact fusion strategies (score averaging, learned weights, etc.) or the number of component networks per submission, which are central to reproducing the described systems.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the review of our DCASE 2019 Task 1 technical report and for the recommendation of minor revision. The report provides a concise summary of our fusion approach but lists no specific major comments requiring point-by-point rebuttal.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript is a descriptive DCASE 2019 technical report detailing submitted systems (fusions of VGG-like, LCNN and x-vector networks with self-attention pooling on log-Mel features, trained via 4-fold splits of development data). No derivation chain, first-principles result, or mathematical prediction is claimed; the text is limited to configuration description and fusion strategies. Standard challenge practice of developing on dev folds does not constitute self-definitional or fitted-input circularity under the enumerated patterns, as no internal reduction of a claimed output to its own inputs occurs. The external challenge test set provides an independent benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or invented entities are described in the abstract; the work is an empirical system description relying on standard neural-network training assumptions.

pith-pipeline@v0.9.0 · 5701 in / 1114 out tokens · 24296 ms · 2026-05-24T21:45:00.343737+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 4 internal anchors

  1. [1]

    We proposed three different deep neural network topologies for this task

    INTRODUCTION This report describes Brno University of Technology (BUT) t eam submissions for the ASC challenge of DCASE 2019. We proposed three different deep neural network topologies for this task. The first one is a VGG like [1] two-dimensional CNN network for process - ing audio segments. The second network is again a 2-dimensio nal CNN network which c...

  2. [2]

    The dataset consists of recordings from 10 scene c lasses and was collected in 12 large European cities and in differen t en- vironments in each city

    DATASET In this challenge, an enhanced version of ASC dataset was used [11]. The dataset consists of recordings from 10 scene c lasses and was collected in 12 large European cities and in differen t en- vironments in each city. The development set of the dataset f or task1a consists of 1440 segments for each acoustic scene and in to- tal 40 hours of audio...

  3. [3]

    Features The log Mel-scale spectrogram was used as a feature in this ch al- lenge

    DATA PROCESSING 3.1. Features The log Mel-scale spectrogram was used as a feature in this ch al- lenge. For extracting the features, first, we converted the a udio to a mono-channel and removed the amplitude bias by subtract - ing the audio segment’s mean from the signal. Then short time Fourier transform is computed on 2048 samples Hamming win- dowed fram...

  4. [4]

    The first one is a VGG like two-dimensional CNN

    CNN TOPOLOGIES We have used three different CNN topologies for this challen ge. The first one is a VGG like two-dimensional CNN. The second topology is an enhanced version of Light-CNN (LCNN) which used Max-Feature-Map (MFM) as an additional non-linearity. MFM re- duces the number of kernels to half. As a result, the final netw ork has fewer parameters and ...

  5. [5]

    SYSTEMS AND FUSION In this challenge, we fused outputs of different networks to obtain the final results. First, we made a 4-folds cross-validation setup us- ing the whole development data in addition to the official pro vided setup (we only use the official validation set for report resu lts here and for the final system we only used the generated folds). By...

  6. [6]

    EXPERIMENTS AND RESULTS 6.1. Experimental Setups Similar to the baseline system provided by the organizers, o ur net- works training was performed by optimizing the categorical cross- entropy using Adam optimizer [15]. The initial learning rat e was set to 0.001 and the network training was early-stopped if the validation loss did not decrease for more th...

  7. [7]

    Diff erent systems were designed for this challenge and the final system s were fusions of the output scores from the individual system

    CONCLUSIONS We have described the systems submitted by BUT team to Acous- tic Scene Classification (ASC) challenge of DCASE2019. Diff erent systems were designed for this challenge and the final system s were fusions of the output scores from the individual system. A tr ained fusion as well as a majority vote fusion were used for the final sys- tem. The prop...

  8. [8]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman, “V ery deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014

  9. [9]

    A light CNN for deep face representation with noisy labels,

    X. Wu, R. He, Z. Sun, and T. Tan, “A light CNN for deep face representation with noisy labels,” IEEE Transactions on Information F orensics and Security, vol. 13, no. 11, pp. 2884– 2896, 2018

  10. [10]

    Audio replay at- tack detection with deep learning frameworks,

    G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev, and V . Shchemelinin, “Audio replay at- tack detection with deep learning frameworks,” in Proc. Interspeech 2017 , 2017, pp. 82–86. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2017-360

  11. [11]

    Detecting spoofing at - tacks using vgg and sincnet: but-omilia submission to asvspoof 2019 challenge,

    H. Zeinali, T. Stafylakis, G. Athanasopoulou, J. Rohdin , I. Gkinis, L. Burget, and J. Cernocky, “Detecting spoofing at - tacks using vgg and sincnet: but-omilia submission to asvspoof 2019 challenge,” in Proc. Interspeech 2019, 2019

  12. [12]

    X-vectors: Robust DNN embeddings for speaker recognition,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu - danpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 5329–5333

  13. [13]

    How to improve your speaker embeddings extrac- tor in generic toolkits,

    H. Zeinali, L. Burget, J. Rohdin, T. Stafylakis, and J. H. Cer- nocky, “How to improve your speaker embeddings extrac- tor in generic toolkits,” in ICASSP 2019-2019 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6141–6145

  14. [14]

    Convolutional Neural Networks and x-vector Embedding for DCASE2018 Acoustic Scene Classification Challenge

    H. Zeinali, L. Burget, and J. Cernocky, “Convolutional neural networks and x-vector embedding for DCASE2018 acoustic scene classification challenge,” arXiv preprint arXiv:1810.04273, 2018

  15. [15]

    Self- attentive speaker embeddings for text-independent speake r verification,

    Y . Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self- attentive speaker embeddings for text-independent speake r verification,” Proc. Interspeech 2018, pp. 3573–3577, 2018

  16. [16]

    Attentive Statistics Pooling for Deep Speaker Embedding

    K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive stati s- tics pooling for deep speaker embedding,” arXiv preprint arXiv:1803.10963, 2018

  17. [17]

    Attention-based models for text-dependent speaker verifi ca- tion,

    F. R. rahman Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attention-based models for text-dependent speaker verifi ca- tion,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 5359–5363

  18. [18]

    A multi-devi ce dataset for urban acoustic scene classification,

    A. Mesaros, T. Heittola, and T. Virtanen, “A multi-devi ce dataset for urban acoustic scene classification,” in IEEE AASP Challenge on Detection and Classification of Acoustic Scene s and Events (DCASE) , 2018

  19. [19]

    librosa: Audio and music sig- nal analysis in python,

    B. McFee, C. Raffel, D. Liang, D. P . Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music sig- nal analysis in python,” in Proceedings of the 14th python in science conference, 2015, pp. 18–25. Detection and Classification of Acoustic Scenes and Events 2 019 Challenge

  20. [20]

    Acoustic scene classification with fully convolutional neural networks and I-vectors,

    M. Dorfer, B. Lehner, H. Eghbal-zadeh, H. Christop, P . Fabian, and W. Gerhard, “Acoustic scene classification with fully convolutional neural networks and I-vectors,” DCASE2018 Challenge, Tech. Rep., September 2018

  21. [21]

    FoCal multi-class: Toolkit for evaluati on, fu- sion and calibration of multi-class recognition scorestut orial and user manual,

    N. Br¨ ummer, “FoCal multi-class: Toolkit for evaluati on, fu- sion and calibration of multi-class recognition scorestut orial and user manual,” Software available at http://sites. google. com/site/nikobrummer/focalmulticlass, 2007

  22. [22]

    Adam: A Method for Stochastic Optimization

    D. P . Kingma and J. Ba, “Adam: A method for stochastic op- timization,” arXiv preprint arXiv:1412.6980, 2014