Training on test data: Removing near duplicates in Fashion-MNIST

Christopher Geier

arxiv: 1906.08255 · v1 · pith:7DITFT7Rnew · submitted 2019-06-19 · 💻 cs.LG · cs.CV· stat.ML

Training on test data: Removing near duplicates in Fashion-MNIST

Christopher Geier This is my paper

Pith reviewed 2026-05-25 20:04 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML

keywords Fashion-MNISTnear-duplicatesdata leakagebenchmarkmachine learningtest accuracy

0 comments

The pith

Near-duplicate images between training and test sets inflate Fashion-MNIST accuracies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates near-duplicate images shared between the training and testing sets of Fashion-MNIST. These duplicates create data leakage that artificially boosts reported model accuracies. By identifying the duplicates and releasing a version of the dataset with them removed, the work aims to provide a cleaner benchmark. A reader would care because many machine learning papers rely on Fashion-MNIST results to claim progress, and leakage undermines those claims. The focus is on improving data quality rather than proposing new algorithms.

Core claim

Near-duplicate images between testing and training sets artificially increase the testing accuracy of machine learning models. This paper identifies near-duplicate images in Fashion MNIST and proposes a dataset with near-duplicates removed.

What carries the argument

Near-duplicate detection across train and test splits to remove data leakage.

If this is right

Test accuracies on the cleaned dataset will be lower than on the original for the same models.
Published results using Fashion-MNIST may require adjustment for the effect of duplicates.
The cleaned dataset provides a more reliable measure of generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same near-duplicate issue could affect other image classification benchmarks.
Deduplication steps might be added to standard dataset preparation pipelines.

Load-bearing premise

The procedure used to label images as near-duplicates correctly identifies cases that produce data leakage in typical model training pipelines.

What would settle it

Compare the test accuracy of a model trained and evaluated on the original Fashion-MNIST versus the version with near-duplicates removed; a significant drop would support the claim.

Figures

Figures reproduced from arXiv: 1906.08255 by Christopher Geier.

**Figure 2.** Figure 2: User interface used to label images as distinct versus similar [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Number of images removed from the testing set for each class [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

MNIST and Fashion MNIST are extremely popular for testing in the machine learning space. Fashion MNIST improves on MNIST by introducing a harder problem, increasing the diversity of testing sets, and more accurately representing a modern computer vision task. In order to increase the data quality of FashionMNIST, this paper investigates near duplicate images between training and testing sets. Near-duplicates between testing and training sets artificially increase the testing accuracy of machine learning models. This paper identifies near-duplicate images in Fashion MNIST and proposes a dataset with near-duplicates removed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper spots near-duplicates in Fashion-MNIST and offers a cleaned version, but never checks whether those pairs actually drive higher test accuracy.

read the letter

The core observation is that Fashion-MNIST has near-duplicates across its train and test splits, and the author supplies a version with them removed. That is the actual new piece: a concrete, usable cleaned dataset for this one benchmark rather than a general method or theoretical result. The work is straightforward and addresses a known practical problem in image benchmarks, so anyone running experiments on Fashion-MNIST can now use the cleaned split if they want to reduce one source of leakage. Credit for doing the scan and releasing the data instead of just complaining about it. The limitation is that the paper asserts the duplicates artificially inflate test accuracy but does not test the claim. There is no split of the test set into duplicate versus non-duplicate images, no accuracy numbers on each subset, and no retraining experiment showing a measurable drop after removal. Without that link, the leakage effect remains an assumption rather than a measured result. The similarity procedure itself is not described in enough detail in the abstract to judge its false-positive rate or sensitivity, though the reader notes the full text may expand on it. This is a narrow, incremental data-cleaning note. It is useful for people who already work with Fashion-MNIST and want higher evaluation hygiene, but it does not reorganize any broader practice or open new questions. I would bring it to a reading group focused on benchmark construction or data quality, but I would not cite it in my own work. It is solid enough to deserve referee time rather than a desk reject, mainly because the cleaned dataset is a tangible output that others can verify and use.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies near-duplicate images between the training and test splits of Fashion-MNIST, asserts that these duplicates artificially inflate reported test accuracies of ML models, and proposes a cleaned version of the dataset with the duplicates removed.

Significance. If the detection procedure is reliable and the accuracy-inflation effect is empirically confirmed, the work would supply a higher-quality public benchmark and underscore data-leakage risks in standard vision datasets; the provision of the cleaned dataset itself would be a concrete, reusable contribution.

major comments (2)

[Abstract] Abstract: the claim that near-duplicates 'artificially increase the testing accuracy' is asserted without any reported measurement of model accuracy on the identified duplicate subset versus the non-duplicate test images, or any before/after removal experiment; this differential-performance evidence is required to substantiate the central causal claim.
The similarity metric, decision threshold, and any validation of the near-duplicate labeling procedure are not described, leaving the weakest assumption (that the labeled pairs are precisely the ones that produce leakage under typical training pipelines) unexamined and load-bearing for all downstream claims.

minor comments (1)

The number and distribution of identified near-duplicates should be reported quantitatively (e.g., counts per class or similarity-score histogram) to allow readers to gauge the scale of the issue.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where the manuscript's claims require stronger empirical support and methodological transparency. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that near-duplicates 'artificially increase the testing accuracy' is asserted without any reported measurement of model accuracy on the identified duplicate subset versus the non-duplicate test images, or any before/after removal experiment; this differential-performance evidence is required to substantiate the central causal claim.

Authors: We agree that the manuscript asserts the accuracy-inflation effect without providing the requested differential measurements or before/after experiments. The original work focuses on detection and release of the cleaned dataset but does not include model training results on the duplicate subset versus the remainder or direct comparisons of test accuracy before and after removal. We will add these experiments in revision, reporting accuracies for representative models on the identified duplicate images, the non-duplicate test images, and overall before/after removal. revision: yes
Referee: [—] The similarity metric, decision threshold, and any validation of the near-duplicate labeling procedure are not described, leaving the weakest assumption (that the labeled pairs are precisely the ones that produce leakage under typical training pipelines) unexamined and load-bearing for all downstream claims.

Authors: The referee is correct that the manuscript does not adequately describe the similarity metric, threshold, or validation of the labeling. We will expand the methods section to specify the exact metric and threshold used, and add validation details (for example, sample visualizations or quantitative checks) to support that the detected pairs are the ones likely to produce leakage. revision: yes

Circularity Check

0 steps flagged

Empirical dataset scan with no derivation chain or self-referential steps

full rationale

The paper is an empirical investigation that scans the public Fashion-MNIST dataset for near-duplicates using a similarity procedure and releases a cleaned version. No equations, parameters, or predictions are derived; the abstract and title simply state the motivation and outcome of the scan. No self-citations, ansatzes, or uniqueness theorems appear in the provided text. The work is self-contained against the external public dataset and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, no explicit free parameters, axioms, or invented entities are stated; the work rests on the unelaborated assumption that a similarity-based duplicate detector can be applied to this dataset.

pith-pipeline@v0.9.0 · 5606 in / 948 out tokens · 26033 ms · 2026-05-25T20:04:31.966633+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Do we train on test data? purging CIFAR of near-duplicates, 2019

Björn Barz and Joachim Denzler. Do we train on test data? purging CIFAR of near-duplicates, 2019

work page 2019
[2]

MNIST handwritten digit database

Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/

work page 2010
[3]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. URL https://openai.com/blog/better-language-models, 2019

work page 2019
[4]

Fashion MNIST with Keras and TPU s, 2018

TensorFlow Hub . Fashion MNIST with Keras and TPU s, 2018. URL https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/fashion_mnist.ipynb

work page 2018
[5]

Fashion- MNIST : a novel image dataset for benchmarking machine learning algorithms, 2017

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion- MNIST : a novel image dataset for benchmarking machine learning algorithms, 2017

work page 2017
[6]

Visualizing and understanding convolutional networks, 2013

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks, 2013

work page 2013
[7]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[1] [1]

Do we train on test data? purging CIFAR of near-duplicates, 2019

Björn Barz and Joachim Denzler. Do we train on test data? purging CIFAR of near-duplicates, 2019

work page 2019

[2] [2]

MNIST handwritten digit database

Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/

work page 2010

[3] [3]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. URL https://openai.com/blog/better-language-models, 2019

work page 2019

[4] [4]

Fashion MNIST with Keras and TPU s, 2018

TensorFlow Hub . Fashion MNIST with Keras and TPU s, 2018. URL https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/fashion_mnist.ipynb

work page 2018

[5] [5]

Fashion- MNIST : a novel image dataset for benchmarking machine learning algorithms, 2017

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion- MNIST : a novel image dataset for benchmarking machine learning algorithms, 2017

work page 2017

[6] [6]

Visualizing and understanding convolutional networks, 2013

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks, 2013

work page 2013

[7] [7]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page