Training on test data: Removing near duplicates in Fashion-MNIST
Pith reviewed 2026-05-25 20:04 UTC · model grok-4.3
The pith
Near-duplicate images between training and test sets inflate Fashion-MNIST accuracies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Near-duplicate images between testing and training sets artificially increase the testing accuracy of machine learning models. This paper identifies near-duplicate images in Fashion MNIST and proposes a dataset with near-duplicates removed.
What carries the argument
Near-duplicate detection across train and test splits to remove data leakage.
If this is right
- Test accuracies on the cleaned dataset will be lower than on the original for the same models.
- Published results using Fashion-MNIST may require adjustment for the effect of duplicates.
- The cleaned dataset provides a more reliable measure of generalization.
Where Pith is reading between the lines
- The same near-duplicate issue could affect other image classification benchmarks.
- Deduplication steps might be added to standard dataset preparation pipelines.
Load-bearing premise
The procedure used to label images as near-duplicates correctly identifies cases that produce data leakage in typical model training pipelines.
What would settle it
Compare the test accuracy of a model trained and evaluated on the original Fashion-MNIST versus the version with near-duplicates removed; a significant drop would support the claim.
Figures
read the original abstract
MNIST and Fashion MNIST are extremely popular for testing in the machine learning space. Fashion MNIST improves on MNIST by introducing a harder problem, increasing the diversity of testing sets, and more accurately representing a modern computer vision task. In order to increase the data quality of FashionMNIST, this paper investigates near duplicate images between training and testing sets. Near-duplicates between testing and training sets artificially increase the testing accuracy of machine learning models. This paper identifies near-duplicate images in Fashion MNIST and proposes a dataset with near-duplicates removed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript identifies near-duplicate images between the training and test splits of Fashion-MNIST, asserts that these duplicates artificially inflate reported test accuracies of ML models, and proposes a cleaned version of the dataset with the duplicates removed.
Significance. If the detection procedure is reliable and the accuracy-inflation effect is empirically confirmed, the work would supply a higher-quality public benchmark and underscore data-leakage risks in standard vision datasets; the provision of the cleaned dataset itself would be a concrete, reusable contribution.
major comments (2)
- [Abstract] Abstract: the claim that near-duplicates 'artificially increase the testing accuracy' is asserted without any reported measurement of model accuracy on the identified duplicate subset versus the non-duplicate test images, or any before/after removal experiment; this differential-performance evidence is required to substantiate the central causal claim.
- The similarity metric, decision threshold, and any validation of the near-duplicate labeling procedure are not described, leaving the weakest assumption (that the labeled pairs are precisely the ones that produce leakage under typical training pipelines) unexamined and load-bearing for all downstream claims.
minor comments (1)
- The number and distribution of identified near-duplicates should be reported quantitatively (e.g., counts per class or similarity-score histogram) to allow readers to gauge the scale of the issue.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments correctly identify areas where the manuscript's claims require stronger empirical support and methodological transparency. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that near-duplicates 'artificially increase the testing accuracy' is asserted without any reported measurement of model accuracy on the identified duplicate subset versus the non-duplicate test images, or any before/after removal experiment; this differential-performance evidence is required to substantiate the central causal claim.
Authors: We agree that the manuscript asserts the accuracy-inflation effect without providing the requested differential measurements or before/after experiments. The original work focuses on detection and release of the cleaned dataset but does not include model training results on the duplicate subset versus the remainder or direct comparisons of test accuracy before and after removal. We will add these experiments in revision, reporting accuracies for representative models on the identified duplicate images, the non-duplicate test images, and overall before/after removal. revision: yes
-
Referee: [—] The similarity metric, decision threshold, and any validation of the near-duplicate labeling procedure are not described, leaving the weakest assumption (that the labeled pairs are precisely the ones that produce leakage under typical training pipelines) unexamined and load-bearing for all downstream claims.
Authors: The referee is correct that the manuscript does not adequately describe the similarity metric, threshold, or validation of the labeling. We will expand the methods section to specify the exact metric and threshold used, and add validation details (for example, sample visualizations or quantitative checks) to support that the detected pairs are the ones likely to produce leakage. revision: yes
Circularity Check
Empirical dataset scan with no derivation chain or self-referential steps
full rationale
The paper is an empirical investigation that scans the public Fashion-MNIST dataset for near-duplicates using a similarity procedure and releases a cleaned version. No equations, parameters, or predictions are derived; the abstract and title simply state the motivation and outcome of the scan. No self-citations, ansatzes, or uniqueness theorems appear in the provided text. The work is self-contained against the external public dataset and does not reduce any claimed result to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Do we train on test data? purging CIFAR of near-duplicates, 2019
Björn Barz and Joachim Denzler. Do we train on test data? purging CIFAR of near-duplicates, 2019
work page 2019
-
[2]
MNIST handwritten digit database
Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/
work page 2010
-
[3]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. URL https://openai.com/blog/better-language-models, 2019
work page 2019
-
[4]
Fashion MNIST with Keras and TPU s, 2018
TensorFlow Hub . Fashion MNIST with Keras and TPU s, 2018. URL https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/fashion_mnist.ipynb
work page 2018
-
[5]
Fashion- MNIST : a novel image dataset for benchmarking machine learning algorithms, 2017
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion- MNIST : a novel image dataset for benchmarking machine learning algorithms, 2017
work page 2017
-
[6]
Visualizing and understanding convolutional networks, 2013
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks, 2013
work page 2013
-
[7]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.