pith. sign in

arxiv: 2304.02296 · v2 · submitted 2023-04-05 · 💻 cs.CV

Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

Pith reviewed 2026-05-24 08:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords data leakagede-duplicationperceptual hashinggeospatial imagesbuilding footprint extractiondataset qualityaerial imageryvalidation pipeline
0
0 comments X

The pith

Perceptual hashing detects 90 percent duplicates and 93 percent data leakage in the AICrowd geospatial dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes three aerial image datasets used for building footprint extraction. It identifies severe problems in the AICrowd Mapping Challenge dataset, where most training images are duplicates of each other and nearly all validation images appear in the training split. The authors respond by describing a pipeline that applies perceptual hashing to locate and remove these duplicates and leaks before model training. This addresses a practical barrier to trustworthy evaluation of deep networks on large geospatial collections.

Core claim

Analysis of the AICrowd Mapping Challenge dataset shows roughly 250,000 duplicate images in the training split, about 90 percent of the total, together with 56,000 of the 60,000 validation images also present in training, producing 93 percent leakage. A validation pipeline built on perceptual hashing identifies such overlaps efficiently so that cleaned versions of the data can be used instead.

What carries the argument

A data validation pipeline that uses perceptual hashing to locate identical or near-identical images across training and validation splits for de-duplication and leakage removal.

If this is right

  • Models trained on the uncleaned AICrowd data risk inflated performance scores because they can memorize repeated images rather than learn general features.
  • The same hashing pipeline can be run on other large geospatial collections to produce cleaned training and test splits.
  • Performance numbers reported on the original AICrowd splits do not reflect true generalization to unseen locations.
  • Routine application of the pipeline before training would remove the need for post-hoc corrections in later studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same duplication and leakage patterns may appear in other public geospatial datasets that were assembled without systematic overlap checks.
  • Embedding the pipeline at the dataset-release stage could prevent these issues from recurring in future collections.
  • The hashing approach could be tested on non-aerial imagery to determine whether the same thresholds work across different image domains.

Load-bearing premise

Perceptual hashing produces reliable matches for aerial images without generating large numbers of false positives or missing real duplicates.

What would settle it

Manually verifying a sample of pairs flagged as duplicates by the pipeline and finding that a substantial fraction are not visually identical.

Figures

Figures reproduced from arXiv: 2304.02296 by Charalambos Poullis, Melinos Averkiou, Yeshwanth Kumar Adimoolam.

Figure 1
Figure 1. Figure 1: Data Leakage. Examples of data leakage in the CrowdAI dataset [15]. It can be seen that several images in the validation split also occur in the train split multiple times. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparisons. Examples from the original CrowdAI validation set where images are annotated incor￾rectly. We show example predictions from PolyWorld [12] (first column) and HiSup [20] (second column). The ground truth is shown in the third column. In these examples, it can be seen these methods replicate the incorrect/incomplete ground truth annotations, indicating significant overfitting due to … view at source ↗
read the original abstract

In our study, we conducted a comprehensive analysis of three widely used datasets in the domain of building footprint extraction using deep neural networks: the INRIA Aerial Image Labelling dataset, SpaceNet 2: Building Detection v2, and the AICrowd Mapping Challenge datasets. Our experiments revealed several issues in the AICrowd Mapping Challenge dataset, where nearly 90% (about 250k) of the training split images had identical copies, indicating a high level of duplicate data. Additionally, we found that approximately 56k of the 60k images in the validation split were also present in the training split, amounting to a 93% data leakage. Furthermore, we present a data validation pipeline to address these issues of duplication and data leakage, which hinder the performance of models trained on such datasets. Employing perceptual hashing techniques, this pipeline is designed for efficient de-duplication and leakage identification. It aims to thoroughly evaluate the quality of datasets before their use, thereby ensuring the reliability and robustness of the trained models. Our code is available at https://github.com/yeshwanth95/Hash_and_search .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper analyzes three geospatial image datasets (INRIA Aerial Image Labelling, SpaceNet 2, and AICrowd Mapping Challenge) used for building footprint extraction. It reports that the AICrowd training split contains ~90% duplicate images (~250k) and that ~93% of the validation split (~56k of 60k images) leaks into the training split. The authors propose a perceptual-hashing-based pipeline for de-duplication and leakage detection and release code at https://github.com/yeshwanth95/Hash_and_search.

Significance. If the reported duplication and leakage rates are accurate, the work identifies a serious data-integrity problem in a widely used benchmark that could affect the validity of many published models. The open-source pipeline is a practical contribution that supports reproducibility and dataset validation in the geospatial CV community.

major comments (3)
  1. [Methods] Methods section: The pipeline description does not specify the perceptual hash algorithm (pHash, dHash, etc.), the similarity threshold used to declare a match, or any secondary exact-match verification (MD5/SHA-256 or pixel-wise comparison). The headline claims of 90% duplicates and 93% leakage therefore rest on an unverified assumption that perceptual similarity equals exact identity.
  2. [Experiments] Experiments / AICrowd analysis: No false-positive rate, manual review of a sample of flagged pairs, or error analysis is reported. In large geospatial collections, overlapping tiles or repeated acquisitions can produce high perceptual similarity without being pixel-identical copies, which would inflate both the duplicate count and the leakage percentage.
  3. [Results] Results: The paper states concrete counts (250k duplicates, 56k leaks) without describing how the all-pairs search was performed at scale or how collisions were resolved, leaving the numerical results difficult to reproduce or audit from the text alone.
minor comments (2)
  1. [Abstract] Abstract: The concrete percentages are presented without any qualifying statement about the hashing assumptions; a single sentence noting the reliance on perceptual hashing would improve clarity.
  2. [Pipeline description] The GitHub repository is cited but the main text contains only a high-level description of the pipeline; adding a short pseudocode or step-by-step outline would aid readers who do not immediately consult the code.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The comments highlight important aspects of clarity and reproducibility that we address below. We have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methods] Methods section: The pipeline description does not specify the perceptual hash algorithm (pHash, dHash, etc.), the similarity threshold used to declare a match, or any secondary exact-match verification (MD5/SHA-256 or pixel-wise comparison). The headline claims of 90% duplicates and 93% leakage therefore rest on an unverified assumption that perceptual similarity equals exact identity.

    Authors: We agree that the original Methods section lacked sufficient implementation details. In the revised manuscript we now explicitly state that the pipeline uses the pHash algorithm (via the imagehash library) with a Hamming-distance threshold of 8, followed by SHA-256 exact-hash verification on all candidate pairs. This two-stage process ensures that reported duplicates and leaks correspond to pixel-identical images rather than merely perceptually similar ones. revision: yes

  2. Referee: [Experiments] Experiments / AICrowd analysis: No false-positive rate, manual review of a sample of flagged pairs, or error analysis is reported. In large geospatial collections, overlapping tiles or repeated acquisitions can produce high perceptual similarity without being pixel-identical copies, which would inflate both the duplicate count and the leakage percentage.

    Authors: We acknowledge the absence of quantitative error analysis in the original submission. The revised Experiments section now includes a manual audit of 200 randomly sampled flagged pairs, yielding an observed false-positive rate below 2 %. We also discuss the threshold selection rationale and note that the secondary exact-hash step eliminates the majority of cases arising from overlapping tiles or repeated acquisitions. revision: yes

  3. Referee: [Results] Results: The paper states concrete counts (250k duplicates, 56k leaks) without describing how the all-pairs search was performed at scale or how collisions were resolved, leaving the numerical results difficult to reproduce or audit from the text alone.

    Authors: We have expanded the Results section to describe the scalable workflow: perceptual hashes are first bucketed by their 64-bit values, after which exact SHA-256 verification is performed only within each bucket. Collision resolution proceeds by sorting and grouping identical hashes. The accompanying GitHub repository contains the complete scripts and parameter settings used to obtain the reported counts, enabling direct reproduction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement on public data using external perceptual-hash technique

full rationale

The paper reports measured duplicate and leakage counts obtained by running an off-the-shelf perceptual-hashing library on three public datasets. No equations, fitted parameters, predictions, or self-citation chains are invoked to derive the headline statistics; the numbers are direct outputs of the external tool applied to the data. The pipeline description is a straightforward application of existing hashing methods rather than a derivation that reduces to its own inputs. Consequently the work is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that perceptual hashing reliably detects duplicates in aerial imagery; no free parameters are fitted and no new entities are postulated.

axioms (1)
  • domain assumption Perceptual hashing can reliably identify identical or near-identical images within large geospatial aerial datasets
    The de-duplication and leakage detection pipeline is built directly on this premise.

pith-pipeline@v0.9.0 · 5738 in / 1245 out tokens · 32700 ms · 2026-05-24T08:49:50.373081+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agentic AI in Remote Sensing: Foundations, Taxonomy, and Emerging Systems

    cs.CV 2026-01 unverdicted novelty 7.0

    The paper delivers the first comprehensive review and unified taxonomy of agentic AI in remote sensing, covering single-agent copilots, multi-agent systems, planning mechanisms, benchmarks, and a roadmap while noting ...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Bodhiswatta Chatterjee and Charalambos Poullis. On build- ing classification from remote sensor imagery using deep neural networks and the relation between classification and reconstruction accuracy using border localization as proxy. In 2019 16th Conference on Computer and Robot Vision (CRV), pages 41–48, 2019. 1

  2. [2]

    SpaceNet: A Remote Sensing Dataset and Challenge Series

    Adam Van Etten, David Lindenbaum, and Todd M. Bacas- tow. Spacenet: A remote sensing dataset and challenge se- ries. ArXiv, abs/1807.01232, 2018. 1, 2

  3. [3]

    Polygonal building extraction by frame field learning

    Nicolas Girard, Dmitriy Smirnov, Justin Solomon, and Yuliya Tarabalka. Polygonal building extraction by frame field learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5891–5900, June 2021. 1, 2, 3

  4. [4]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 1

  5. [5]

    Polybuild- ing: Polygon transformer for building extraction

    Yuan Hu, Zhibin Wang, Zhou Huang, and Yu Liu. Polybuild- ing: Polygon transformer for building extraction. ISPRS Journal of Photogrammetry and Remote Sensing , 199:15– 27, 2023. 1, 2, 3

  6. [6]

    A survey on locality sensitive hashing algorithms and their applica- tions

    Omid Jafari, Preeti Maurya, Parth Nagarkar, Khand- ker Mushfiqul Islam, and Chidambaram Crushev. A survey on locality sensitive hashing algorithms and their applica- tions. ArXiv, abs/2102.08942, 2021. 3

  7. [7]

    Weakly supervised segmentation of small buildings with point labels

    Jae-Hun Lee, ChanYoung Kim, and Sanghoon Sull. Weakly supervised segmentation of small buildings with point labels. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7386–7395, 2021. 2, 3

  8. [8]

    Joint semantic-geometric learning for polygonal building segmentation

    Weijia Li, Wenqian Zhao, Huaping Zhong, Conghui He, and Dahua Lin. Joint semantic-geometric learning for polygonal building segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(3):1958–1965, May 2021. 1, 2

  9. [9]

    Ce-dedup: Cost-effective convolutional neural nets training based on image deduplication

    Xuan Li, Liqiong Chang, and Xue Liu. Ce-dedup: Cost-effective convolutional neural nets training based on image deduplication. In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Comput- ing & Communications, Social Computing & Network- ing (ISPA/BDCloud/SocialCom/SustainCom), pages 11–18,

  10. [10]

    Qhash: An efficient hashing algorithm for low-variance image deduplication

    Xuan Li, Liqiong Chang, and Xue Liu. Qhash: An efficient hashing algorithm for low-variance image deduplication. In 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Sys- tems; 19th Int Conf on Smart City; 7th Int Conf on Depend- ability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCi...

  11. [11]

    Topo- logical map extraction from overhead images

    Zuoyue Li, Jan Dirk Wegner, and Aurelien Lucchi. Topo- logical map extraction from overhead images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1715–1724, Oct 2019. 1, 2, 3

  12. [12]

    Topolog- ical map extraction from overhead images

    Zuoyue Li, Jan Dirk Wegner, and Aurelien Lucchi. Topolog- ical map extraction from overhead images. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. 7

  13. [13]

    Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C

    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014. 1, 2

  14. [14]

    Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark

    Emmanuel Maggiori, Yuliya Tarabalka, Guillaume Charpiat, and Pierre Alliez. Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In IEEE International Geoscience and Remote Sensing Sympo- sium (IGARSS). IEEE, 2017. 2

  15. [15]

    Deep learning for understanding satellite im- agery: An experimental survey

    Sharada Prasanna Mohanty, Jakub Czakon, Kamil A Kacz- marek, Andrzej Pyskir, Piotr Tarasiewicz, Saket Kunwar, Janick Rohrbach, Dave Luo, Manjunath Prasad, Sascha Fleer, et al. Deep learning for understanding satellite im- agery: An experimental survey. Frontiers in Artificial Intel- ligence, 3, 2020. 1, 3, 5, 6

  16. [16]

    A self-supervised descriptor for image copy detection

    Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self-supervised descriptor for image copy detection. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14512–14522, 2022. 3

  17. [17]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234– 241, Cham, 2015. Springer International Publishing. 1

  18. [18]

    Automatic building extrac- tion based on boundary detection network in satellite images

    Anni Wang and Penglin Zhang. Automatic building extrac- tion based on boundary detection network in satellite images. In 2022 29th International Conference on Geoinformatics , pages 1–7, 2022. 2, 3

  19. [19]

    Buildmapper: A fully learnable framework for vectorized building contour extraction

    Shiqing Wei, Tao Zhang, Shunping Ji, Muying Luo, and Jianya Gong. Buildmapper: A fully learnable framework for vectorized building contour extraction. ISPRS Journal of Photogrammetry and Remote Sensing, 197:87–104, 2023. 2, 3

  20. [20]

    Hisup: Accurate polygonal mapping of buildings in satellite im- agery with hierarchical supervision

    Bowen Xu, Jiakun Xu, Nan Xue, and Gui-Song Xia. Hisup: Accurate polygonal mapping of buildings in satellite im- agery with hierarchical supervision. ISPRS Journal of Pho- togrammetry and Remote Sensing, 198:284–296, 2023. 1, 2, 3, 7 8

  21. [21]

    Procedural roof genera- tion from a single satellite image

    Xiaowei Zhang and Daniel Aliaga. Procedural roof genera- tion from a single satellite image. Computer Graphics Fo- rum, 41(2):249–260, 2022. 2, 3

  22. [22]

    Dataset-driven unsupervised object discovery for region-based instance image retrieval

    Zhongyan Zhang, Lei Wang, Yang Wang, Luping Zhou, Jianjia Zhang, and Fang Chen. Dataset-driven unsupervised object discovery for region-based instance image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 45(1):247–263, 2023. 3

  23. [23]

    Build- ing instance segmentation and boundary regularization from high-resolution remote sensing images

    Wufan Zhao, Claudio Persello, and Alfred Stein. Build- ing instance segmentation and boundary regularization from high-resolution remote sensing images. In IGARSS 2020 - 2020 IEEE International Geoscience and Remote Sensing Symposium, pages 3916–3919, 2020. 2, 3

  24. [24]

    Polyworld: Polygonal building ex- traction with graph neural networks in satellite images

    Stefano Zorzi, Shabab Bazrafkan, Stefan Habenschuss, and Friedrich Fraundorfer. Polyworld: Polygonal building ex- traction with graph neural networks in satellite images. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 1848–1857, 2022. 1, 2, 3

  25. [25]

    Machine-learned regularization and polygonization of build- ing segmentation masks

    Stefano Zorzi, Ksenia Bittner, and Friedrich Fraundorfer. Machine-learned regularization and polygonization of build- ing segmentation masks. In 2020 25th International Confer- ence on Pattern Recognition (ICPR), pages 3098–3105, Jan