pith. sign in

arxiv: 1906.08255 · v1 · pith:7DITFT7Rnew · submitted 2019-06-19 · 💻 cs.LG · cs.CV· stat.ML

Training on test data: Removing near duplicates in Fashion-MNIST

Pith reviewed 2026-05-25 20:04 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML
keywords Fashion-MNISTnear-duplicatesdata leakagebenchmarkmachine learningtest accuracy
0
0 comments X

The pith

Near-duplicate images between training and test sets inflate Fashion-MNIST accuracies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates near-duplicate images shared between the training and testing sets of Fashion-MNIST. These duplicates create data leakage that artificially boosts reported model accuracies. By identifying the duplicates and releasing a version of the dataset with them removed, the work aims to provide a cleaner benchmark. A reader would care because many machine learning papers rely on Fashion-MNIST results to claim progress, and leakage undermines those claims. The focus is on improving data quality rather than proposing new algorithms.

Core claim

Near-duplicate images between testing and training sets artificially increase the testing accuracy of machine learning models. This paper identifies near-duplicate images in Fashion MNIST and proposes a dataset with near-duplicates removed.

What carries the argument

Near-duplicate detection across train and test splits to remove data leakage.

If this is right

  • Test accuracies on the cleaned dataset will be lower than on the original for the same models.
  • Published results using Fashion-MNIST may require adjustment for the effect of duplicates.
  • The cleaned dataset provides a more reliable measure of generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same near-duplicate issue could affect other image classification benchmarks.
  • Deduplication steps might be added to standard dataset preparation pipelines.

Load-bearing premise

The procedure used to label images as near-duplicates correctly identifies cases that produce data leakage in typical model training pipelines.

What would settle it

Compare the test accuracy of a model trained and evaluated on the original Fashion-MNIST versus the version with near-duplicates removed; a significant drop would support the claim.

Figures

Figures reproduced from arXiv: 1906.08255 by Christopher Geier.

Figure 1
Figure 1. Figure 1: Very similar images sample [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: User interface used to label images as distinct versus similar [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Number of images removed from the testing set for each class [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

MNIST and Fashion MNIST are extremely popular for testing in the machine learning space. Fashion MNIST improves on MNIST by introducing a harder problem, increasing the diversity of testing sets, and more accurately representing a modern computer vision task. In order to increase the data quality of FashionMNIST, this paper investigates near duplicate images between training and testing sets. Near-duplicates between testing and training sets artificially increase the testing accuracy of machine learning models. This paper identifies near-duplicate images in Fashion MNIST and proposes a dataset with near-duplicates removed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies near-duplicate images between the training and test splits of Fashion-MNIST, asserts that these duplicates artificially inflate reported test accuracies of ML models, and proposes a cleaned version of the dataset with the duplicates removed.

Significance. If the detection procedure is reliable and the accuracy-inflation effect is empirically confirmed, the work would supply a higher-quality public benchmark and underscore data-leakage risks in standard vision datasets; the provision of the cleaned dataset itself would be a concrete, reusable contribution.

major comments (2)
  1. [Abstract] Abstract: the claim that near-duplicates 'artificially increase the testing accuracy' is asserted without any reported measurement of model accuracy on the identified duplicate subset versus the non-duplicate test images, or any before/after removal experiment; this differential-performance evidence is required to substantiate the central causal claim.
  2. The similarity metric, decision threshold, and any validation of the near-duplicate labeling procedure are not described, leaving the weakest assumption (that the labeled pairs are precisely the ones that produce leakage under typical training pipelines) unexamined and load-bearing for all downstream claims.
minor comments (1)
  1. The number and distribution of identified near-duplicates should be reported quantitatively (e.g., counts per class or similarity-score histogram) to allow readers to gauge the scale of the issue.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where the manuscript's claims require stronger empirical support and methodological transparency. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that near-duplicates 'artificially increase the testing accuracy' is asserted without any reported measurement of model accuracy on the identified duplicate subset versus the non-duplicate test images, or any before/after removal experiment; this differential-performance evidence is required to substantiate the central causal claim.

    Authors: We agree that the manuscript asserts the accuracy-inflation effect without providing the requested differential measurements or before/after experiments. The original work focuses on detection and release of the cleaned dataset but does not include model training results on the duplicate subset versus the remainder or direct comparisons of test accuracy before and after removal. We will add these experiments in revision, reporting accuracies for representative models on the identified duplicate images, the non-duplicate test images, and overall before/after removal. revision: yes

  2. Referee: [—] The similarity metric, decision threshold, and any validation of the near-duplicate labeling procedure are not described, leaving the weakest assumption (that the labeled pairs are precisely the ones that produce leakage under typical training pipelines) unexamined and load-bearing for all downstream claims.

    Authors: The referee is correct that the manuscript does not adequately describe the similarity metric, threshold, or validation of the labeling. We will expand the methods section to specify the exact metric and threshold used, and add validation details (for example, sample visualizations or quantitative checks) to support that the detected pairs are the ones likely to produce leakage. revision: yes

Circularity Check

0 steps flagged

Empirical dataset scan with no derivation chain or self-referential steps

full rationale

The paper is an empirical investigation that scans the public Fashion-MNIST dataset for near-duplicates using a similarity procedure and releases a cleaned version. No equations, parameters, or predictions are derived; the abstract and title simply state the motivation and outcome of the scan. No self-citations, ansatzes, or uniqueness theorems appear in the provided text. The work is self-contained against the external public dataset and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, no explicit free parameters, axioms, or invented entities are stated; the work rests on the unelaborated assumption that a similarity-based duplicate detector can be applied to this dataset.

pith-pipeline@v0.9.0 · 5606 in / 948 out tokens · 26033 ms · 2026-05-25T20:04:31.966633+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

  1. [1]

    Do we train on test data? purging CIFAR of near-duplicates, 2019

    Björn Barz and Joachim Denzler. Do we train on test data? purging CIFAR of near-duplicates, 2019

  2. [2]

    MNIST handwritten digit database

    Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/

  3. [3]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. URL https://openai.com/blog/better-language-models, 2019

  4. [4]

    Fashion MNIST with Keras and TPU s, 2018

    TensorFlow Hub . Fashion MNIST with Keras and TPU s, 2018. URL https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/fashion_mnist.ipynb

  5. [5]

    Fashion- MNIST : a novel image dataset for benchmarking machine learning algorithms, 2017

    Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion- MNIST : a novel image dataset for benchmarking machine learning algorithms, 2017

  6. [6]

    Visualizing and understanding convolutional networks, 2013

    Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks, 2013

  7. [7]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...