WebVision Database: Visual Learning and Understanding from Web Data

Wen Li , Limin Wang , Wei Li , Eirikur Agustsson , Luc Van Gool

Authors on Pith no claims yet

classification 💻 cs.CV

keywords databasedatasetvisualimageswebvisiondatalearningadaptation

read the original abstract

In this paper, we present a study on learning visual recognition models from large scale noisy web data. We build a new database called WebVision, which contains more than $2.4$ million web images crawled from the Internet by using queries generated from the 1,000 semantic concepts of the benchmark ILSVRC 2012 dataset. Meta information along with those web images (e.g., title, description, tags, etc.) are also crawled. A validation set and test set containing human annotated images are also provided to facilitate algorithmic development. Based on our new database, we obtain a few interesting observations: 1) the noisy web images are sufficient for training a good deep CNN model for visual recognition; 2) the model learnt from our WebVision database exhibits comparable or even better generalization ability than the one trained from the ILSVRC 2012 dataset when being transferred to new datasets and tasks; 3) a domain adaptation issue (a.k.a., dataset bias) is observed, which means the dataset can be used as the largest benchmark dataset for visual domain adaptation. Our new WebVision database and relevant studies in this work would benefit the advance of learning state-of-the-art visual models with minimum supervision based on web data.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Loss Values: Robust Dynamic Pruning via Loss Trajectory Alignment
cs.CV 2026-04 unverdicted novelty 6.0

AlignPrune uses a Dynamic Alignment Score from loss trajectories to identify noisy samples more accurately than per-sample loss, improving pruning accuracy by up to 6.3% on noisy benchmarks.
Learning from Imperfect Text Guidance: Robust Long-Tail Visual Recognition with High-Noise Label
cs.CV 2026-04 unverdicted novelty 5.0

Weak Teacher Supervision uses vision-language model text predictions and label discrepancy checks to mitigate high-noise label-image mismatches in long-tailed visual recognition.
See Through the Noise: Improving Domain Generalization in Gaze Estimation
cs.CV 2026-04 unverdicted novelty 5.0

SeeTN builds a semantic embedding space with prototype transformation and affinity regularization to identify and correct noisy labels, yielding better cross-domain gaze estimation without hurting source accuracy.