YouTube-8M: A Large-Scale Video Classification Benchmark

Sami Abu-el-haija , Nisarg Kothari , Joonseok Lee , Paul Natsev , George Toderici , Balakrishnan Varadarajan , Sudheendra Vijayanarasimhan

Authors on Pith no claims yet

classification 💻 cs.CV

keywords videolabelsclassificationdatasetdatasetsmodelsvideosentities

0 comments

read the original abstract

Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets. In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities. To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics. While the labels are machine-generated, they have high-precision and are derived from a variety of human-based signals including metadata and query click signals. We filtered the video labels (Knowledge Graph entities) using both automated and manual curation strategies, including asking human raters if the labels are visually recognizable. Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and report them as baselines. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow. We plan to release code for training a TensorFlow model and for computing metrics.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MusicLM: Generating Music From Text
cs.SD 2023-01 conditional novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
cs.SD 2025-07 unverdicted novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
cs.CV 2026-04 unverdicted novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 6.0

SS3D pretrains an end-to-end 3D estimator on filtered YouTube-8M videos via SfM self-supervision, achieving improved zero-shot transfer and fine-tuning over prior baselines.
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 6.0

SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior ...
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 5.0

SS3D scales SfM-based self-supervision to ~100M frames from YouTube-8M using a multi-view signal proxy for filtering and a two-stage training schedule, achieving strong zero-shot transfer and better fine-tuning than p...