YouTube-8M: A Large-Scale Video Classification Benchmark

Balakrishnan Varadarajan; George Toderici; Joonseok Lee; Nisarg Kothari; Paul Natsev; Sami Abu-el-haija; Sudheendra Vijayanarasimhan

arxiv: 1609.08675 · v1 · pith:5MT2PO5Gnew · submitted 2016-09-27 · 💻 cs.CV

YouTube-8M: A Large-Scale Video Classification Benchmark

Sami Abu-el-haija , Nisarg Kothari , Joonseok Lee , Paul Natsev , George Toderici , Balakrishnan Varadarajan , Sudheendra Vijayanarasimhan This is my paper

classification 💻 cs.CV

keywords videolabelsclassificationdatasetdatasetsmodelsvideosentities

0 comments

read the original abstract

Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets. In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities. To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics. While the labels are machine-generated, they have high-precision and are derived from a variety of human-based signals including metadata and query click signals. We filtered the video labels (Knowledge Graph entities) using both automated and manual curation strategies, including asking human raters if the labels are visually recognizable. Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and report them as baselines. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow. We plan to release code for training a TensorFlow model and for computing metrics.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MusicLM: Generating Music From Text
cs.SD 2023-01 conditional novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
USV: Towards Understanding the User-generated Short-form Videos
cs.CV 2026-05 unverdicted novelty 7.0

Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.
SCP: Spatial Causal Prediction in Video
cs.CV 2026-03 unverdicted novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
Recurrent Video Masked Autoencoders
cs.CV 2025-12 unverdicted novelty 7.0

RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better pa...
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
cs.SD 2025-07 unverdicted novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
cs.AI 2024-07 accept novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
The TIME Machine: On The Power of Motion for Efficient Perception
cs.CV 2026-05 unverdicted novelty 6.0

TIME is a motion-based embedding from point tracks, trained only on synthetic data via masked autoencoding, that matches state-of-the-art video model performance with up to 10,000x less training data.
PipeANN-Filter: An Efficient Filtered Vector Search System on SSD
cs.OS 2026-05 unverdicted novelty 6.0

PipeANN-Filter improves filtered vector search latency and throughput on SSD by exploring a superset of valid vectors identified via probabilistic filters and verifying attributes only after selecting top-k candidates.
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
cs.CV 2026-04 unverdicted novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 6.0

SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior ...
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 6.0

SS3D pretrains an end-to-end 3D estimator on filtered YouTube-8M videos via SfM self-supervision, achieving improved zero-shot transfer and fine-tuning over prior baselines.
Structure Over Scale: Learning Visual Reasoning from Pedagogical Video
cs.CV 2026-01 unverdicted novelty 6.0

Fine-tuning VLMs on 10K QA pairs from pedagogical children's videos produces consistent gains on NExT-QA, Video-MME, and MotionBench, indicating that explicit structure can substitute for data scale.
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
cs.CV 2023-10 unverdicted novelty 6.0

LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
Tracking Temporal Evolution of Graphs using Non-Timestamped Data
cs.SI 2019-07 unverdicted novelty 6.0

Presents YoutubeGraph-Dyn, a multi-modal dynamic graph dataset from YouTube interactions with intra-day snapshots, and benchmarks clustering for community migration plus time series and RNN methods for forecasting non...
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 5.0

SS3D scales SfM-based self-supervision to ~100M frames from YouTube-8M using a multi-view signal proxy for filtering and a two-stage training schedule, achieving strong zero-shot transfer and better fine-tuning than p...
SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets
cs.CV 2025-10 conditional novelty 5.0

SD-MVSum extends script-driven video summarization to multimodal inputs by modeling script-video and script-transcript relevance with a new weighted cross-modal attention mechanism, plus extended S-VideoXum and MrHiSu...
KFC-W: Generating 3D-Consistent Videos from Unposed Internet Photos
cs.CV 2024-11 unverdicted novelty 5.0

KFC-W is a self-supervised 3D-aware video model trained on videos and multiview internet photos that produces geometrically consistent interpolations between unposed input images without any 3D annotations.
Baidu-UTS Submission to the EPIC-Kitchens Action Recognition Challenge 2019
cs.CV 2019-06 unverdicted novelty 4.0

Baidu-UTS won the EPIC-Kitchens challenge by guiding 3D CNN training with object detection features via a Gated Feature Aggregator to improve noun prediction.
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
cs.CV 2025-08 unverdicted novelty 3.0

A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.