Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling

Hangting Chen; Pengyuan Zhang; Yonghong Yan; Zongming Liu; Zuozhen Liu

arxiv: 1907.06639 · v1 · pith:3AECMOJCnew · submitted 2019-07-15 · 📡 eess.AS · cs.LG· cs.SD

Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling

Hangting Chen , Zuozhen Liu , Zongming Liu , Pengyuan Zhang , Yonghong Yan This is my paper

Pith reviewed 2026-05-24 21:26 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD

keywords acoustic scene classificationdata augmentationgenerative adversarial networksconvolutional neural networksmodel fusionscalogram featuresMel filter banks

0 comments

The pith

Integrating generative adversarial network data augmentation with convolutional neural network classifiers and fusion strategies achieves over 85 percent accuracy in acoustic scene classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a data augmentation approach based on generative adversarial networks can be combined with two primary neural network classifiers—one using scalogram features in a 1D deep convolutional network and the other using Mel filter bank features in a 2D fully convolutional network—to improve acoustic scene classification. Additional techniques including city adaptation, discrete cosine transform temporal modules, and hybrid architectures are incorporated before final fusion of multiple systems. A sympathetic reader would care because limited labeled audio data often constrains performance in sound environment recognition tasks, and successful augmentation plus fusion offers a practical route to higher accuracy without requiring vastly more real recordings.

Core claim

The authors show that their generative adversarial network data augmentation scheme, when paired with the listed classifiers and supporting modules and then fused across systems A through D, produces acoustic scene classification accuracy higher than 85 percent on the provided fold 1 evaluation dataset.

What carries the argument

The generative adversarial network data augmentation scheme integrated with 1D and 2D convolutional classifiers plus fusion of multiple variants including adaptation and temporal modules.

If this is right

Fusion across multiple classifier architectures and auxiliary modules produces measurable gains in classification accuracy.
Both scalogram-based 1D networks and Mel-based 2D networks contribute usefully when augmented by the same generative scheme.
City adaptation and discrete cosine transform temporal modules can be added as compatible enhancements before fusion.
The overall pipeline reaches accuracy above 85 percent on the stated evaluation set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same augmentation-plus-fusion pattern could be tested on other audio tasks that suffer from scarce labeled data.
Performance differences between the scalogram and Mel feature streams might reveal which acoustic properties matter most for scene discrimination.
Repeating the experiments on later evaluation folds or entirely new recording conditions would test whether the accuracy holds beyond the reported dataset.

Load-bearing premise

The generative adversarial network augmentation, the chosen classifiers, and the fusion approach can be combined and trained without overfitting or other failure modes on the evaluation data.

What would settle it

Running the described fusion systems A-D on the fold 1 evaluation dataset and obtaining accuracy at or below 85 percent would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 1907.06639 by Hangting Chen, Pengyuan Zhang, Yonghong Yan, Zongming Liu, Zuozhen Liu.

**Figure 2.** Figure 2: The scheme of data augmentation. LACGAN = Lreal/fake + γLscene, (3) where γ controls the ratio between real/fake loss and scene classification loss. The noise to generate fake features is usually assumed following Gaussian distribution, which actually makes little sense. To obtain samples as real as possible, a CVAE/ACGAN architecture (Figure 1(b)) is deployed as an alternative to ACGAN. It uses encode… view at source ↗

**Figure 4.** Figure 4: Inception module II network architecture. [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 3.** Figure 3: Inception module I network architecture. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

This technical report describes the IOA team's submission for TASK1A of DCASE2019 challenge. Our acoustic scene classification (ASC) system adopts a data augmentation scheme employing generative adversary networks. Two major classifiers, 1D deep convolutional neural network integrated with scalogram features and 2D fully convolutional neural network integrated with Mel filter bank features, are deployed in the scheme. Other approaches, such as adversary city adaptation, temporal module based on discrete cosine transform and hybrid architectures, have been developed for further fusion. The results of our experiments indicates that the final fusion systems A-D could achieve an accuracy higher than 85% on the officially provided fold 1 evaluation dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard DCASE challenge report that fuses GAN augmentation with off-the-shelf CNNs to hit >85% on fold 1, but adds no new methods or analysis.

read the letter

This paper is the IOA team's DCASE 2019 task 1A submission report. They apply a previously published GAN data augmentation scheme, run a 1D CNN on scalogram features and a 2D FCN on Mel features, add a few extras like city adaptation and a DCT temporal module, then fuse the outputs. The headline result is that the fused systems reach over 85% accuracy on the official fold-1 evaluation set. That number is the only concrete claim. The work does what challenge reports are supposed to do: it documents one team's system and its score on the benchmark. For other participants who want to know what produced a competitive entry, the description of the components is straightforward and usable at a high level. The soft spots are exactly what the abstract shows. There are no training details, no validation protocol, no ablation results, no error bars, and no discussion of how the models were kept from overfitting the evaluation data. The paper is purely empirical with no derivations or new framework. Because the full text follows the same pattern as the abstract, the 85% figure cannot be checked or reproduced from what is written. This kind of report is mainly for people already inside the DCASE acoustic scene track who need to see submitted systems. It does not reorganize any part of the field or give a reader outside the challenge anything they can build on. I would not bring it to a reading group, would not cite it, and would not send it for peer review. It belongs as a technical report in the challenge proceedings, not as a journal paper.

Referee Report

1 major / 2 minor

Summary. The manuscript is a technical report describing the IOA team's submission to DCASE 2019 Task 1A acoustic scene classification. It integrates a GAN-based data augmentation scheme with two primary classifiers (1D deep CNN on scalogram features and 2D fully convolutional network on Mel filter bank features), plus auxiliary techniques including adversarial city adaptation, DCT-based temporal modules, and hybrid architectures. The central empirical claim is that the final fusion systems A-D achieve accuracy higher than 85% on the officially provided fold-1 evaluation dataset.

Significance. If the reported accuracies are accurate and the systems are reproducible, the work would provide a competitive DCASE 2019 entry demonstrating that GAN augmentation combined with multi-classifier fusion can exceed 85% on the challenge fold-1 set. This would offer a practical data point for the community on the utility of generative augmentation in ASC, though the lack of methodological detail limits assessment of whether the gains are robust or generalizable beyond the specific submission.

major comments (1)

[Abstract] Abstract: The central claim that fusion systems A-D exceed 85% accuracy is presented with no accompanying details on training procedure, optimizer settings, data augmentation parameters, validation splits, number of runs, or error bars. This information is load-bearing for verifying an empirical performance claim in an experimental paper.

minor comments (2)

[Abstract] Abstract: 'The results of our experiments indicates' contains a subject-verb agreement error and should read 'indicate'.
[Abstract] Abstract: 'adversary city adaptation' appears to be a typographical error for the standard term 'adversarial city adaptation'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review of our DCASE 2019 technical report. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that fusion systems A-D exceed 85% accuracy is presented with no accompanying details on training procedure, optimizer settings, data augmentation parameters, validation splits, number of runs, or error bars. This information is load-bearing for verifying an empirical performance claim in an experimental paper.

Authors: We agree that the abstract, as written, is too concise to allow verification of the >85% claim. The manuscript body describes the GAN augmentation scheme, the 1D CNN on scalograms, the 2D FCN on Mel features, the city-adversarial adaptation, DCT temporal modules, and the four fusion systems, but does not supply the requested hyper-parameter list, optimizer settings, or statistical details. Because this information is absent from the current version, we will revise the abstract (and, if space permits, add a short experimental-setup paragraph) to include the key training and evaluation parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical report

full rationale

The paper is a DCASE technical report whose central claim consists solely of measured classification accuracies on the official fold-1 evaluation set after applying GAN augmentation and various neural classifiers. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the manuscript. The reported performance is externally verifiable against the challenge data and does not reduce to any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no mathematical structure, free parameters, or invented entities are described.

pith-pipeline@v0.9.0 · 5654 in / 956 out tokens · 29966 ms · 2026-05-24T21:26:06.258749+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

[1]

INTRODUCTION Acoustic scene classiﬁcation (ASC) aims to classify sounds into one of predeﬁned classes [1]. Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) challenges organized by IEEE Audio and Signal Processing (AASP) Technical Committee are one of the biggest competitions for ASC task. The large-scale dataset provided by DCASE2019 pres...

work page
[2]

Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling

THE DATA AUGMENTATION SCHEME Figure 1: (a) ACGAN and (b) CV AE/ACGAN architecture for data augmentation. Though most ASC systems can accurately classify training samples, they suffer from inferring test records, especially those from unseen cities [2][3]. To improve generalization, additional samples are generated and added into the database. Inspired by ...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

FBank-FCNN Classiﬁer The FBank-FCNN network architecture is shown in Table 1, simi- lar to the one proposed in [7]

CLASSIFICATION SYSTEMS 3.1. FBank-FCNN Classiﬁer The FBank-FCNN network architecture is shown in Table 1, simi- lar to the one proposed in [7]. It is a VGG [8] style Network with 10 repeatedly stacked convolution layers containing small convolu- tional kernels. The common techniques in deep learning, such as batch normalization, dropout and Rectiﬁed Linea...

work page 2019
[4]

Model ensemble can stabilize and generalize our ﬁ- nal results

ENSEMBLE METHODOLOGY Different classiﬁers may lead to divisions among some controver- sial samples. Model ensemble can stabilize and generalize our ﬁ- nal results. In the practice, voting serves as a simple and effec- tive method compared with support vector machine, regression, etc. Two strategies, average and weighted voting, are adopted. The lat- ter’s...

work page
[5]

Then the systems were retrained on the whole development data for submission

EXPERIMENTS We used the ofﬁcially provided fold 1 procedure to evaluate our sys- tems’ performance. Then the systems were retrained on the whole development data for submission. The train set was ﬁrstly split into the train and validate set. The classiﬁers were trained on the train set in maximum 200 epochs. The validation set determined the early stoppin...

work page 2019
[6]

For data augmentation, the generator and the discriminator were simpliﬁed versions of the classiﬁer in Section 3.1

The Fbank of a 10-second stereo audio was of the dimen- sion 500× 2× 128 before and 500× 6× 768 after adding delta and delta-delta coefﬁcients. For data augmentation, the generator and the discriminator were simpliﬁed versions of the classiﬁer in Section 3.1. 5.3. Scalogram-DCNN experiments To extract the scalogram, STFT was applied on the raw signal ev- ...

work page
[7]

RESULTS AND DISCUSSION Table 3: Results of experiments of different data augmentation frameworks on the fold 1 evaluation set, where the best performance is in bold. Feature type Channels Data augmentation scheme Accuracy(%) Fbank Left-Right w/o 76.92 Fbank Left-Right ACGAN 77.56 FBank Ave-Diff w/o 79.95 FBank Ave-Diff ACGAN 80.10 Scalogram Left-Dight w/o...

work page
[8]

CONCLUSION This report describes our submissions for DCASE2019 Task1A. In the fold 1 evaluation setup, the data augmentation scheme in- tegrated with convolutional neural networks and other training ar- chitectures achieved accuracies above77% and 83% for FBank and scalogram. After voting fusion, the ﬁnal systems could achieve ac- curacies above 85%. Dete...

work page 2019
[9]

Tut database for acoustic scene classiﬁcation and sound event detection,

A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acoustic scene classiﬁcation and sound event detection,” pp. 1128–1132, 2016

work page 2016
[10]

DCASE2017 challenge setup: Tasks, datasets and baseline system,

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE2017 challenge setup: Tasks, datasets and baseline system,” in Proceedings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2017 Workshop (DCASE2017) , November 2017, pp. 85–92

work page 2017
[11]

A multi-device dataset for urban acoustic scene classiﬁcation,

A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classiﬁcation,” in Proceed- ings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2018 Workshop (DCASE2018) , November 2018, pp. 9–13

work page 2018
[12]

Conditional image syn- thesis with auxiliary classiﬁer gans,

A. Odena, C. Olah, and J. Shlens, “Conditional image syn- thesis with auxiliary classiﬁer gans,” in Proceedings of the 34th International Conference on Machine Learning-Volume

work page
[13]

org, 2017, pp

JMLR. org, 2017, pp. 2642–2651

work page 2017
[14]

Autoencoding beyond pixels using a learned sim- ilarity metric,

A. B. L. Larsen, S. K. Sonderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned sim- ilarity metric,” international conference on machine learning, pp. 1558–1566, 2016

work page 2016
[15]

Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane,

S. Mun, S. Park, D. Han, and H. Ko, “Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane,” DCASE2017 Challenge, Tech. Rep., September 2017

work page 2017
[16]

Acoustic scene classiﬁcation with fully convolutional neural networks and I-vectors,

M. Dorfer, B. Lehner, H. Eghbal-zadeh, H. Christop, P. Fabian, and W. Gerhard, “Acoustic scene classiﬁcation with fully convolutional neural networks and I-vectors,” DCASE2018 Challenge, Tech. Rep., September 2018

work page 2018
[17]

Very deep convolutional net- works for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition,” international con- ference on learning representations, 2015

work page 2015
[18]

An audio scene classiﬁcation framework with embedded ﬁlters and a dct-based temporal module,

H. Chen, P. Zhang, and Y . Yan, “An audio scene classiﬁcation framework with embedded ﬁlters and a dct-based temporal module,” in ICASSP 2019-2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 835–839

work page 2019
[19]

Deep convolutional neural network with scalogram for audio scene modeling

H. Chen, P. Zhang, H. Bai, Q. Yuan, X. Bao, and Y . Yan, “Deep convolutional neural network with scalogram for audio scene modeling.” in Interspeech, 2018, pp. 3304–3308

work page 2018
[20]

Adversarial discriminative domain adaptation,

E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” computer vision and pat- tern recognition, pp. 2962–2971, 2017

work page 2017
[21]

Going deeper with convolutions,

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9

work page 2015
[22]

Rethinking the inception architecture for computer vision,

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826

work page 2016
[23]

Inception-v4, inception-resnet and the impact of residual connections on learning,

C. Szegedy, S. Ioffe, V . Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017

work page 2017
[24]

Deep Scattering Spectrum

J. And ´en and S. Mallat, “Deep scattering spectrum,” CoRR, vol. abs/1304.6763, 2013. [Online]. Available: http: //arxiv.org/abs/1304.6763

work page internal anchor Pith review Pith/arXiv arXiv 2013

[1] [1]

INTRODUCTION Acoustic scene classiﬁcation (ASC) aims to classify sounds into one of predeﬁned classes [1]. Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) challenges organized by IEEE Audio and Signal Processing (AASP) Technical Committee are one of the biggest competitions for ASC task. The large-scale dataset provided by DCASE2019 pres...

work page

[2] [2]

Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling

THE DATA AUGMENTATION SCHEME Figure 1: (a) ACGAN and (b) CV AE/ACGAN architecture for data augmentation. Though most ASC systems can accurately classify training samples, they suffer from inferring test records, especially those from unseen cities [2][3]. To improve generalization, additional samples are generated and added into the database. Inspired by ...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

FBank-FCNN Classiﬁer The FBank-FCNN network architecture is shown in Table 1, simi- lar to the one proposed in [7]

CLASSIFICATION SYSTEMS 3.1. FBank-FCNN Classiﬁer The FBank-FCNN network architecture is shown in Table 1, simi- lar to the one proposed in [7]. It is a VGG [8] style Network with 10 repeatedly stacked convolution layers containing small convolu- tional kernels. The common techniques in deep learning, such as batch normalization, dropout and Rectiﬁed Linea...

work page 2019

[4] [4]

Model ensemble can stabilize and generalize our ﬁ- nal results

ENSEMBLE METHODOLOGY Different classiﬁers may lead to divisions among some controver- sial samples. Model ensemble can stabilize and generalize our ﬁ- nal results. In the practice, voting serves as a simple and effec- tive method compared with support vector machine, regression, etc. Two strategies, average and weighted voting, are adopted. The lat- ter’s...

work page

[5] [5]

Then the systems were retrained on the whole development data for submission

EXPERIMENTS We used the ofﬁcially provided fold 1 procedure to evaluate our sys- tems’ performance. Then the systems were retrained on the whole development data for submission. The train set was ﬁrstly split into the train and validate set. The classiﬁers were trained on the train set in maximum 200 epochs. The validation set determined the early stoppin...

work page 2019

[6] [6]

For data augmentation, the generator and the discriminator were simpliﬁed versions of the classiﬁer in Section 3.1

The Fbank of a 10-second stereo audio was of the dimen- sion 500× 2× 128 before and 500× 6× 768 after adding delta and delta-delta coefﬁcients. For data augmentation, the generator and the discriminator were simpliﬁed versions of the classiﬁer in Section 3.1. 5.3. Scalogram-DCNN experiments To extract the scalogram, STFT was applied on the raw signal ev- ...

work page

[7] [7]

RESULTS AND DISCUSSION Table 3: Results of experiments of different data augmentation frameworks on the fold 1 evaluation set, where the best performance is in bold. Feature type Channels Data augmentation scheme Accuracy(%) Fbank Left-Right w/o 76.92 Fbank Left-Right ACGAN 77.56 FBank Ave-Diff w/o 79.95 FBank Ave-Diff ACGAN 80.10 Scalogram Left-Dight w/o...

work page

[8] [8]

CONCLUSION This report describes our submissions for DCASE2019 Task1A. In the fold 1 evaluation setup, the data augmentation scheme in- tegrated with convolutional neural networks and other training ar- chitectures achieved accuracies above77% and 83% for FBank and scalogram. After voting fusion, the ﬁnal systems could achieve ac- curacies above 85%. Dete...

work page 2019

[9] [9]

Tut database for acoustic scene classiﬁcation and sound event detection,

A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acoustic scene classiﬁcation and sound event detection,” pp. 1128–1132, 2016

work page 2016

[10] [10]

DCASE2017 challenge setup: Tasks, datasets and baseline system,

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE2017 challenge setup: Tasks, datasets and baseline system,” in Proceedings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2017 Workshop (DCASE2017) , November 2017, pp. 85–92

work page 2017

[11] [11]

A multi-device dataset for urban acoustic scene classiﬁcation,

A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classiﬁcation,” in Proceed- ings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2018 Workshop (DCASE2018) , November 2018, pp. 9–13

work page 2018

[12] [12]

Conditional image syn- thesis with auxiliary classiﬁer gans,

A. Odena, C. Olah, and J. Shlens, “Conditional image syn- thesis with auxiliary classiﬁer gans,” in Proceedings of the 34th International Conference on Machine Learning-Volume

work page

[13] [13]

org, 2017, pp

JMLR. org, 2017, pp. 2642–2651

work page 2017

[14] [14]

Autoencoding beyond pixels using a learned sim- ilarity metric,

A. B. L. Larsen, S. K. Sonderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned sim- ilarity metric,” international conference on machine learning, pp. 1558–1566, 2016

work page 2016

[15] [15]

Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane,

S. Mun, S. Park, D. Han, and H. Ko, “Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane,” DCASE2017 Challenge, Tech. Rep., September 2017

work page 2017

[16] [16]

Acoustic scene classiﬁcation with fully convolutional neural networks and I-vectors,

M. Dorfer, B. Lehner, H. Eghbal-zadeh, H. Christop, P. Fabian, and W. Gerhard, “Acoustic scene classiﬁcation with fully convolutional neural networks and I-vectors,” DCASE2018 Challenge, Tech. Rep., September 2018

work page 2018

[17] [17]

Very deep convolutional net- works for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition,” international con- ference on learning representations, 2015

work page 2015

[18] [18]

An audio scene classiﬁcation framework with embedded ﬁlters and a dct-based temporal module,

H. Chen, P. Zhang, and Y . Yan, “An audio scene classiﬁcation framework with embedded ﬁlters and a dct-based temporal module,” in ICASSP 2019-2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 835–839

work page 2019

[19] [19]

Deep convolutional neural network with scalogram for audio scene modeling

H. Chen, P. Zhang, H. Bai, Q. Yuan, X. Bao, and Y . Yan, “Deep convolutional neural network with scalogram for audio scene modeling.” in Interspeech, 2018, pp. 3304–3308

work page 2018

[20] [20]

Adversarial discriminative domain adaptation,

E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” computer vision and pat- tern recognition, pp. 2962–2971, 2017

work page 2017

[21] [21]

Going deeper with convolutions,

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9

work page 2015

[22] [22]

Rethinking the inception architecture for computer vision,

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826

work page 2016

[23] [23]

Inception-v4, inception-resnet and the impact of residual connections on learning,

C. Szegedy, S. Ioffe, V . Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017

work page 2017

[24] [24]

Deep Scattering Spectrum

J. And ´en and S. Mallat, “Deep scattering spectrum,” CoRR, vol. abs/1304.6763, 2013. [Online]. Available: http: //arxiv.org/abs/1304.6763

work page internal anchor Pith review Pith/arXiv arXiv 2013