pith. sign in

arxiv: 1907.06639 · v1 · pith:3AECMOJCnew · submitted 2019-07-15 · 📡 eess.AS · cs.LG· cs.SD

Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling

Pith reviewed 2026-05-24 21:26 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD
keywords acoustic scene classificationdata augmentationgenerative adversarial networksconvolutional neural networksmodel fusionscalogram featuresMel filter banks
0
0 comments X

The pith

Integrating generative adversarial network data augmentation with convolutional neural network classifiers and fusion strategies achieves over 85 percent accuracy in acoustic scene classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a data augmentation approach based on generative adversarial networks can be combined with two primary neural network classifiers—one using scalogram features in a 1D deep convolutional network and the other using Mel filter bank features in a 2D fully convolutional network—to improve acoustic scene classification. Additional techniques including city adaptation, discrete cosine transform temporal modules, and hybrid architectures are incorporated before final fusion of multiple systems. A sympathetic reader would care because limited labeled audio data often constrains performance in sound environment recognition tasks, and successful augmentation plus fusion offers a practical route to higher accuracy without requiring vastly more real recordings.

Core claim

The authors show that their generative adversarial network data augmentation scheme, when paired with the listed classifiers and supporting modules and then fused across systems A through D, produces acoustic scene classification accuracy higher than 85 percent on the provided fold 1 evaluation dataset.

What carries the argument

The generative adversarial network data augmentation scheme integrated with 1D and 2D convolutional classifiers plus fusion of multiple variants including adaptation and temporal modules.

If this is right

  • Fusion across multiple classifier architectures and auxiliary modules produces measurable gains in classification accuracy.
  • Both scalogram-based 1D networks and Mel-based 2D networks contribute usefully when augmented by the same generative scheme.
  • City adaptation and discrete cosine transform temporal modules can be added as compatible enhancements before fusion.
  • The overall pipeline reaches accuracy above 85 percent on the stated evaluation set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same augmentation-plus-fusion pattern could be tested on other audio tasks that suffer from scarce labeled data.
  • Performance differences between the scalogram and Mel feature streams might reveal which acoustic properties matter most for scene discrimination.
  • Repeating the experiments on later evaluation folds or entirely new recording conditions would test whether the accuracy holds beyond the reported dataset.

Load-bearing premise

The generative adversarial network augmentation, the chosen classifiers, and the fusion approach can be combined and trained without overfitting or other failure modes on the evaluation data.

What would settle it

Running the described fusion systems A-D on the fold 1 evaluation dataset and obtaining accuracy at or below 85 percent would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 1907.06639 by Hangting Chen, Pengyuan Zhang, Yonghong Yan, Zongming Liu, Zuozhen Liu.

Figure 1
Figure 1. Figure 1: (a) ACGAN and (b) CVAE/ACGAN architecture for data [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The scheme of data augmentation. LACGAN = Lreal/fake + γLscene, (3) where γ controls the ratio between real/fake loss and scene classifi￾cation loss. The noise to generate fake features is usually assumed follow￾ing Gaussian distribution, which actually makes little sense. To ob￾tain samples as real as possible, a CVAE/ACGAN architecture (Fig￾ure 1(b)) is deployed as an alternative to ACGAN. It uses encode… view at source ↗
Figure 4
Figure 4. Figure 4: Inception module II network architecture. [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Inception module I network architecture. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

This technical report describes the IOA team's submission for TASK1A of DCASE2019 challenge. Our acoustic scene classification (ASC) system adopts a data augmentation scheme employing generative adversary networks. Two major classifiers, 1D deep convolutional neural network integrated with scalogram features and 2D fully convolutional neural network integrated with Mel filter bank features, are deployed in the scheme. Other approaches, such as adversary city adaptation, temporal module based on discrete cosine transform and hybrid architectures, have been developed for further fusion. The results of our experiments indicates that the final fusion systems A-D could achieve an accuracy higher than 85% on the officially provided fold 1 evaluation dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is a technical report describing the IOA team's submission to DCASE 2019 Task 1A acoustic scene classification. It integrates a GAN-based data augmentation scheme with two primary classifiers (1D deep CNN on scalogram features and 2D fully convolutional network on Mel filter bank features), plus auxiliary techniques including adversarial city adaptation, DCT-based temporal modules, and hybrid architectures. The central empirical claim is that the final fusion systems A-D achieve accuracy higher than 85% on the officially provided fold-1 evaluation dataset.

Significance. If the reported accuracies are accurate and the systems are reproducible, the work would provide a competitive DCASE 2019 entry demonstrating that GAN augmentation combined with multi-classifier fusion can exceed 85% on the challenge fold-1 set. This would offer a practical data point for the community on the utility of generative augmentation in ASC, though the lack of methodological detail limits assessment of whether the gains are robust or generalizable beyond the specific submission.

major comments (1)
  1. [Abstract] Abstract: The central claim that fusion systems A-D exceed 85% accuracy is presented with no accompanying details on training procedure, optimizer settings, data augmentation parameters, validation splits, number of runs, or error bars. This information is load-bearing for verifying an empirical performance claim in an experimental paper.
minor comments (2)
  1. [Abstract] Abstract: 'The results of our experiments indicates' contains a subject-verb agreement error and should read 'indicate'.
  2. [Abstract] Abstract: 'adversary city adaptation' appears to be a typographical error for the standard term 'adversarial city adaptation'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review of our DCASE 2019 technical report. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that fusion systems A-D exceed 85% accuracy is presented with no accompanying details on training procedure, optimizer settings, data augmentation parameters, validation splits, number of runs, or error bars. This information is load-bearing for verifying an empirical performance claim in an experimental paper.

    Authors: We agree that the abstract, as written, is too concise to allow verification of the >85% claim. The manuscript body describes the GAN augmentation scheme, the 1D CNN on scalograms, the 2D FCN on Mel features, the city-adversarial adaptation, DCT temporal modules, and the four fusion systems, but does not supply the requested hyper-parameter list, optimizer settings, or statistical details. Because this information is absent from the current version, we will revise the abstract (and, if space permits, add a short experimental-setup paragraph) to include the key training and evaluation parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical report

full rationale

The paper is a DCASE technical report whose central claim consists solely of measured classification accuracies on the official fold-1 evaluation set after applying GAN augmentation and various neural classifiers. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the manuscript. The reported performance is externally verifiable against the challenge data and does not reduce to any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no mathematical structure, free parameters, or invented entities are described.

pith-pipeline@v0.9.0 · 5654 in / 956 out tokens · 29966 ms · 2026-05-24T21:26:06.258749+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

  1. [1]

    INTRODUCTION Acoustic scene classification (ASC) aims to classify sounds into one of predefined classes [1]. Detection and Classification of Acoustic Scenes and Events (DCASE) challenges organized by IEEE Audio and Signal Processing (AASP) Technical Committee are one of the biggest competitions for ASC task. The large-scale dataset provided by DCASE2019 pres...

  2. [2]

    Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling

    THE DATA AUGMENTATION SCHEME Figure 1: (a) ACGAN and (b) CV AE/ACGAN architecture for data augmentation. Though most ASC systems can accurately classify training samples, they suffer from inferring test records, especially those from unseen cities [2][3]. To improve generalization, additional samples are generated and added into the database. Inspired by ...

  3. [3]

    FBank-FCNN Classifier The FBank-FCNN network architecture is shown in Table 1, simi- lar to the one proposed in [7]

    CLASSIFICATION SYSTEMS 3.1. FBank-FCNN Classifier The FBank-FCNN network architecture is shown in Table 1, simi- lar to the one proposed in [7]. It is a VGG [8] style Network with 10 repeatedly stacked convolution layers containing small convolu- tional kernels. The common techniques in deep learning, such as batch normalization, dropout and Rectified Linea...

  4. [4]

    Model ensemble can stabilize and generalize our fi- nal results

    ENSEMBLE METHODOLOGY Different classifiers may lead to divisions among some controver- sial samples. Model ensemble can stabilize and generalize our fi- nal results. In the practice, voting serves as a simple and effec- tive method compared with support vector machine, regression, etc. Two strategies, average and weighted voting, are adopted. The lat- ter’s...

  5. [5]

    Then the systems were retrained on the whole development data for submission

    EXPERIMENTS We used the officially provided fold 1 procedure to evaluate our sys- tems’ performance. Then the systems were retrained on the whole development data for submission. The train set was firstly split into the train and validate set. The classifiers were trained on the train set in maximum 200 epochs. The validation set determined the early stoppin...

  6. [6]

    For data augmentation, the generator and the discriminator were simplified versions of the classifier in Section 3.1

    The Fbank of a 10-second stereo audio was of the dimen- sion 500× 2× 128 before and 500× 6× 768 after adding delta and delta-delta coefficients. For data augmentation, the generator and the discriminator were simplified versions of the classifier in Section 3.1. 5.3. Scalogram-DCNN experiments To extract the scalogram, STFT was applied on the raw signal ev- ...

  7. [7]

    RESULTS AND DISCUSSION Table 3: Results of experiments of different data augmentation frameworks on the fold 1 evaluation set, where the best performance is in bold. Feature type Channels Data augmentation scheme Accuracy(%) Fbank Left-Right w/o 76.92 Fbank Left-Right ACGAN 77.56 FBank Ave-Diff w/o 79.95 FBank Ave-Diff ACGAN 80.10 Scalogram Left-Dight w/o...

  8. [8]

    CONCLUSION This report describes our submissions for DCASE2019 Task1A. In the fold 1 evaluation setup, the data augmentation scheme in- tegrated with convolutional neural networks and other training ar- chitectures achieved accuracies above77% and 83% for FBank and scalogram. After voting fusion, the final systems could achieve ac- curacies above 85%. Dete...

  9. [9]

    Tut database for acoustic scene classification and sound event detection,

    A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acoustic scene classification and sound event detection,” pp. 1128–1132, 2016

  10. [10]

    DCASE2017 challenge setup: Tasks, datasets and baseline system,

    A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE2017 challenge setup: Tasks, datasets and baseline system,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017) , November 2017, pp. 85–92

  11. [11]

    A multi-device dataset for urban acoustic scene classification,

    A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification,” in Proceed- ings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) , November 2018, pp. 9–13

  12. [12]

    Conditional image syn- thesis with auxiliary classifier gans,

    A. Odena, C. Olah, and J. Shlens, “Conditional image syn- thesis with auxiliary classifier gans,” in Proceedings of the 34th International Conference on Machine Learning-Volume

  13. [13]

    org, 2017, pp

    JMLR. org, 2017, pp. 2642–2651

  14. [14]

    Autoencoding beyond pixels using a learned sim- ilarity metric,

    A. B. L. Larsen, S. K. Sonderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned sim- ilarity metric,” international conference on machine learning, pp. 1558–1566, 2016

  15. [15]

    Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane,

    S. Mun, S. Park, D. Han, and H. Ko, “Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane,” DCASE2017 Challenge, Tech. Rep., September 2017

  16. [16]

    Acoustic scene classification with fully convolutional neural networks and I-vectors,

    M. Dorfer, B. Lehner, H. Eghbal-zadeh, H. Christop, P. Fabian, and W. Gerhard, “Acoustic scene classification with fully convolutional neural networks and I-vectors,” DCASE2018 Challenge, Tech. Rep., September 2018

  17. [17]

    Very deep convolutional net- works for large-scale image recognition,

    K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition,” international con- ference on learning representations, 2015

  18. [18]

    An audio scene classification framework with embedded filters and a dct-based temporal module,

    H. Chen, P. Zhang, and Y . Yan, “An audio scene classification framework with embedded filters and a dct-based temporal module,” in ICASSP 2019-2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 835–839

  19. [19]

    Deep convolutional neural network with scalogram for audio scene modeling

    H. Chen, P. Zhang, H. Bai, Q. Yuan, X. Bao, and Y . Yan, “Deep convolutional neural network with scalogram for audio scene modeling.” in Interspeech, 2018, pp. 3304–3308

  20. [20]

    Adversarial discriminative domain adaptation,

    E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” computer vision and pat- tern recognition, pp. 2962–2971, 2017

  21. [21]

    Going deeper with convolutions,

    C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9

  22. [22]

    Rethinking the inception architecture for computer vision,

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826

  23. [23]

    Inception-v4, inception-resnet and the impact of residual connections on learning,

    C. Szegedy, S. Ioffe, V . Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017

  24. [24]

    Deep Scattering Spectrum

    J. And ´en and S. Mallat, “Deep scattering spectrum,” CoRR, vol. abs/1304.6763, 2013. [Online]. Available: http: //arxiv.org/abs/1304.6763