Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

Hossein Mobahi; Samy Bengio; Vighnesh Birodkar

arxiv: 1901.11409 · v1 · pith:6BNL46B7new · submitted 2019-01-29 · 💻 cs.CV · cs.LG· stat.ML

Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

Vighnesh Birodkar , Hossein Mobahi , Samy Bengio This is my paper

classification 💻 cs.CV cs.LGstat.ML

keywords datasetsdataredundanciesbeenbetterfindmodelssemantic

0 comments

read the original abstract

Large datasets have been crucial to the success of deep learning models in the recent years, which keep performing better as they are trained with more labelled data. While there have been sustained efforts to make these models more data-efficient, the potential benefit of understanding the data itself, is largely untapped. Specifically, focusing on object recognition tasks, we wonder if for common benchmark datasets we can do better than random subsets of the data and find a subset that can generalize on par with the full dataset when trained on. To our knowledge, this is the first result that can find notable redundancies in CIFAR-10 and ImageNet datasets (at least 10%). Interestingly, we observe semantic correlations between required and redundant images. We hope that our findings can motivate further research into identifying additional redundancies and exploiting them for more efficient training or data-collection.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Interaction-Aware Influence Functions for Group Attribution
cs.LG 2026-05 conditional novelty 6.0

Extends influence functions with a second-order pairwise interaction term that improves group attribution accuracy over simple summation on multiple model-dataset pairs and instruction-tuning selection tasks.
TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models
eess.IV 2025-10 unverdicted novelty 6.0

TinyUSFM distills a large ultrasound foundation model into a lightweight version using feature-gradient coreset selection and domain-separated masked image modeling, matching performance on a new 18-dataset benchmark ...
Data Selection for training Semantic Segmentation CNNs with cross-dataset weak supervision
cs.CV 2019-07 unverdicted novelty 5.0

Two data selection techniques (GMM visual similarity and bounding-box diversity) reduce required weakly labeled images by up to 100x on Open Images and 20x on Cityscapes while maintaining semantic segmentation performance.