Unifying Dataset Pruning and Distillation for Efficient Large-scale Compression

Lingao Xiao; Songhua Liu; Xinchao Wang; Yang He

arxiv: 2502.06434 · v2 · pith:YRTPF5SPnew · submitted 2025-02-10 · 💻 cs.CV · cs.LG

Unifying Dataset Pruning and Distillation for Efficient Large-scale Compression

Lingao Xiao , Songhua Liu , Yang He , Xinchao Wang This is my paper

classification 💻 cs.CV cs.LG

keywords datasetdistillationimagepruningbenchmarkimageswhilecompression

0 comments

read the original abstract

Dataset pruning (DP) and dataset distillation (DD) fundamentally differ in their outputs: DP selects original image subsets, while DD generates synthetic images. Recently, DD's increasing reliance on original images suggests a convergence of the two directions. To investigate this convergence trend, we propose a unified dataset compression (DC) benchmark. This benchmark reveals an interesting trade-off for soft-label-DD: while soft labels provide valuable information, they can make the distillation process less essential, as distilled images may not always outperform random subsets. In addition, the benchmark reveals that in current stages, dataset pruning outperforms dataset distillation at small dataset sizes. Given these observations, we explore hard-label-DC as a complementary approach that emphasizes image quality while offering substantial storage efficiency. Our PCA (Prune, Combine, and Augment) is the first framework that does not rely on soft labels but instead focuses on image quality. (1) "P'' means selecting easy samples based on dataset pruning metrics, (2) "C'' indicates combining these samples effectively, and (3) "A'' is to apply constrained image augmentation during training. Our code is available at https://github.com/ArmandXiao/Unifying-Dataset-Pruning-and-Distillation

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Soft Label Pruning and Quantization for Large-Scale Dataset Distillation
cs.CV 2026-04 unverdicted novelty 6.0

LPQLD reduces soft label storage in dataset distillation by 78-500x on ImageNet datasets via pruning with dynamic reuse and quantization with student-teacher alignment, while improving accuracy.