CNN Architectures for Large-Scale Audio Classification

Shawn Hershey , Sourish Chaudhuri , Daniel P. W. Ellis , Jort F. Gemmeke , Aren Jansen , R. Channing Moore , Manoj Plakal , Devin Platt

show 5 more authors

Rif A. Saurous Bryan Seybold Malcolm Slaney Ron J. Weiss Kevin Wilson

Authors on Pith no claims yet

classification 💻 cs.SD cs.LGstat.ML

keywords classificationaudiotrainingarchitecturescnnsimagelabelnetworks

0 comments

read the original abstract

Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VideoPoet: A Large Language Model for Zero-Shot Video Generation
cs.CV 2023-12 unverdicted novelty 6.0

VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.