pith. sign in

arxiv: 1901.11409 · v1 · pith:6BNL46B7new · submitted 2019-01-29 · 💻 cs.CV · cs.LG· stat.ML

Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

classification 💻 cs.CV cs.LGstat.ML
keywords datasetsdataredundanciesbeenbetterfindmodelssemantic
0
0 comments X
read the original abstract

Large datasets have been crucial to the success of deep learning models in the recent years, which keep performing better as they are trained with more labelled data. While there have been sustained efforts to make these models more data-efficient, the potential benefit of understanding the data itself, is largely untapped. Specifically, focusing on object recognition tasks, we wonder if for common benchmark datasets we can do better than random subsets of the data and find a subset that can generalize on par with the full dataset when trained on. To our knowledge, this is the first result that can find notable redundancies in CIFAR-10 and ImageNet datasets (at least 10%). Interestingly, we observe semantic correlations between required and redundant images. We hope that our findings can motivate further research into identifying additional redundancies and exploiting them for more efficient training or data-collection.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Interaction-Aware Influence Functions for Group Attribution

    cs.LG 2026-05 conditional novelty 6.0

    Extends influence functions with a second-order pairwise interaction term that improves group attribution accuracy over simple summation on multiple model-dataset pairs and instruction-tuning selection tasks.

  2. TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models

    eess.IV 2025-10 unverdicted novelty 6.0

    TinyUSFM distills a large ultrasound foundation model into a lightweight version using feature-gradient coreset selection and domain-separated masked image modeling, matching performance on a new 18-dataset benchmark ...

  3. Data Selection for training Semantic Segmentation CNNs with cross-dataset weak supervision

    cs.CV 2019-07 unverdicted novelty 5.0

    Two data selection techniques (GMM visual similarity and bounding-box diversity) reduce required weakly labeled images by up to 100x on Open Images and 20x on Cityscapes while maintaining semantic segmentation performance.