YFCC100M: The New Data in Multimedia Research

Bart Thomee; Benjamin Elizalde; Damian Borth; David A. Shamma; Douglas Poland; Gerald Friedland; Karl Ni; Li-Jia Li

arxiv: 1503.01817 · v2 · pith:BCMJWOEPnew · submitted 2015-03-05 · 💻 cs.MM · cs.CY

YFCC100M: The New Data in Multimedia Research

Bart Thomee , David A. Shamma , Gerald Friedland , Benjamin Elizalde , Karl Ni , Douglas Poland , Damian Borth , Li-Jia Li This is my paper

classification 💻 cs.MM cs.CY

keywords datasetmillionflickrmediamultimediaresearchcollectioncommons

0 comments

read the original abstract

We present the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M), the largest public multimedia collection that has ever been released. The dataset contains a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all of which carry a Creative Commons license. Each media object in the dataset is represented by several pieces of metadata, e.g. Flickr identifier, owner name, camera, title, tags, geo, media source. The collection provides a comprehensive snapshot of how photos and videos were taken, described, and shared over the years, from the inception of Flickr in 2004 until early 2014. In this article we explain the rationale behind its creation, as well as the implications the dataset has for science, research, engineering, and development. We further present several new challenges in multimedia research that can now be expanded upon with our dataset.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Emerging Properties in Self-Supervised Vision Transformers
cs.CV 2021-04 conditional novelty 8.0

Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
Tackle CSM in JPEG Steganalysis with Data Adaptation
eess.IV 2026-05 unverdicted novelty 7.0

TADA adapts steganalysis models to unknown JPEG processing pipelines via data emulation from small unlabeled sets, yielding gains in robustness to cover source mismatch over baselines.
Scaling Laws for Autoregressive Generative Modeling
cs.LG 2020-10 accept novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.