pith. sign in

arxiv: 2606.31983 · v1 · pith:NYXNHGUVnew · submitted 2026-06-30 · 💻 cs.DB

Clean Me If You Can: A Large Collection of Real-World Addresses for Data Cleaning Benchmarking

Pith reviewed 2026-07-01 02:06 UTC · model grok-4.3

classification 💻 cs.DB
keywords data cleaningbenchmark datasetpostal addressesreal-world dataerror correctionground truthtabular data
0
0 comments X

The pith

A large real-world dataset of dirty postal addresses with ground truth shows existing data cleaning methods have significant limitations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a large collection of real-world postal address entries that contain errors, paired with corresponding ground truth corrections. This fills a gap where prior data cleaning research relied mostly on controlled or synthetic data that does not reflect actual usage. Tests on the new dataset demonstrate that current cleaning approaches struggle with the variety and realism of the errors present. The authors also extract guidelines to steer future work toward more robust techniques. The result is a public benchmark that lets researchers evaluate methods under conditions closer to production data.

Core claim

By releasing a large dirty dataset of postal entries paired with accurate ground truth, the authors enable better evaluation of data cleaning approaches, which are shown to have limitations on this realistic data, and provide guidelines for future research.

What carries the argument

The dataset of postal addresses with ground truth, which acts as a benchmark to test and reveal weaknesses in data cleaning methods.

If this is right

  • Existing data cleaning approaches perform poorly when applied to this realistic collection of dirty postal data.
  • Future research should follow the derived guidelines to address the complexities of real-world errors.
  • Benchmarking efforts need to shift toward datasets that capture actual production data distributions rather than controlled test cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same collection process could be repeated for other tabular domains such as financial or medical records to create additional benchmarks.
  • The dataset offers a concrete testbed for developing and comparing machine learning models aimed at address correction.
  • Hybrid systems that combine rule-based and learned cleaning steps may show improved results when measured against this resource.

Load-bearing premise

The collected postal entries are representative of real-world dirty data and the provided ground truth is accurate and unbiased.

What would settle it

Independent verification showing systematic errors in the ground truth labels, or a cleaning method achieving high accuracy across the full dataset without having seen its contents during development.

Figures

Figures reproduced from arXiv: 2606.31983 by Fatemeh Ahmadi, Luca Zecchini, Mohamed Abdelmaksoud, Tilmann Rabl, Tobias Bernhard, Ziawasch Abedjan.

Figure 1
Figure 1. Figure 1: Schema alignment between extracted addresses [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Country distribution in the full-named dataset. 4.3 Ethical Concerns In alignment with prior work that analyzed the Common Crawl archives [36], we refrained from including any explicit personally identifiable information (PII), such as SSNs, emails, phone numbers, and banking information. After extraction, we also used the under￾lying PII detection library14 to identify unintentionally captured PII. No PII… view at source ↗
Figure 3
Figure 3. Figure 3: Percentages of clean and erroneous cells (grouped [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: F1-score of different data cleaning systems for error detection and correction on subsets of [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

There has been extensive research on automating and scaling data cleaning, i.e., the detection and correction of erroneous values in tabular data. Yet, existing approaches often perform well only within controlled environments. One of the major bottlenecks in data cleaning research is the lack of real-world datasets. In this paper, we address this gap by providing a large, dirty dataset with postal entries and their corresponding ground truth. We discuss the design decisions and challenges for obtaining the dataset. We demonstrate the limitations of existing cleaning approaches when faced with our proposed datasets and derive guidelines for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a large collection of real-world dirty postal address entries paired with corresponding ground truth. It discusses design decisions and challenges encountered during dataset construction, evaluates the performance of existing data cleaning methods on the new data to illustrate their limitations, and derives guidelines for future data cleaning research.

Significance. A well-documented, large-scale benchmark with verified ground truth and representative error patterns would address a recognized gap in data cleaning research, where most evaluations rely on synthetic or small-scale data. If the collection and verification processes are shown to be independent and unbiased, the resource could enable more realistic method comparisons and the guidelines could usefully inform subsequent work. The contribution is primarily the dataset release itself rather than new algorithms or proofs.

major comments (2)
  1. [Abstract] Abstract and design decisions discussion: the central claim that the dataset is 'large' with 'corresponding ground truth' and that existing methods fail on it is unsupported by any quantitative details on collection methodology, total number of entries, error-type distribution, or the independent verification process used to establish ground truth. Without these, the benchmark value and the demonstration of method limitations cannot be assessed.
  2. [Design decisions section] The section discussing design decisions and challenges does not address how ground truth was obtained in a manner independent of the cleaning methods being benchmarked, nor does it report any cross-validation or external reference checks. This directly affects the weakest assumption that the ground truth is accurate and unbiased for postal addresses.
minor comments (1)
  1. [Abstract] The abstract could explicitly state the scale (number of addresses, number of methods evaluated) rather than using only qualitative descriptors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify opportunities to strengthen the presentation of quantitative details and the ground truth verification process. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract and design decisions discussion: the central claim that the dataset is 'large' with 'corresponding ground truth' and that existing methods fail on it is unsupported by any quantitative details on collection methodology, total number of entries, error-type distribution, or the independent verification process used to establish ground truth. Without these, the benchmark value and the demonstration of method limitations cannot be assessed.

    Authors: We agree that the abstract and design decisions discussion would benefit from explicit quantitative support for the claims. Although the manuscript body contains these details in the dataset construction and evaluation sections, we will revise the abstract to include a concise summary of collection methodology, total entries, error-type distribution, and verification approach. We will also add a short quantitative overview paragraph to the design decisions section. This will make the benchmark value and method limitations clearer without altering the core contribution. revision: yes

  2. Referee: [Design decisions section] The section discussing design decisions and challenges does not address how ground truth was obtained in a manner independent of the cleaning methods being benchmarked, nor does it report any cross-validation or external reference checks. This directly affects the weakest assumption that the ground truth is accurate and unbiased for postal addresses.

    Authors: We will revise the design decisions section to explicitly describe the ground truth acquisition process and its independence from the benchmarked cleaning methods. The revision will include details on any cross-validation steps or external reference sources used to confirm accuracy. This directly addresses the concern about potential bias and strengthens the assumption of reliable ground truth. revision: yes

Circularity Check

0 steps flagged

Dataset release with no derivations or fitted results

full rationale

The paper contributes a collection of real-world dirty postal address data together with ground truth and empirical benchmarks of existing cleaners. No equations, predictions, parameters, or derivation chains appear in the manuscript. Claims rest on data collection methodology and direct experimental comparison rather than any self-referential reduction, fitted-input prediction, or self-citation load-bearing step. The ground-truth verification process is a methodological decision external to any mathematical construction and does not create circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data-collection and benchmarking paper with no mathematical model, so the ledger contains no entries.

pith-pipeline@v0.9.1-grok · 5641 in / 954 out tokens · 33457 ms · 2026-07-01T02:06:32.489531+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    Mohamed Abdelaal, Christian Hammacher, and Harald Schöning. 2023. REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines. InProceedings of the International Conference on Extending Database Technology (EDBT). 499–511. https://doi.org/10.48786/edbt.2023.43

  2. [2]

    Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker

    Ziawasch Abedjan, Cuneyt G. Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. Temporal Rules Discovery for Web Data Cleaning. Proceedings of the VLDB Endowment (PVLDB)9, 4 (2015), 336–347. https://doi. org/10.14778/2856318.2856328

  3. [3]

    Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang

    Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. De- tecting Data Errors: Where are we and what needs to be done?Proceedings of the VLDB Endowment (PVLDB)9, 12 (2016), 993–1004. https://doi.org/10.14778/ 2994509.2994518

  4. [4]

    Arocena, Boris Glavic, Giansalvatore Mecca, Renée J

    Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, and Donatello Santoro. 2015. Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms.Proceedings of the VLDB Endowment (PVLDB)9, 2 (2015), 36–47. https://doi.org/10.14778/2850578.2850579

  5. [5]

    Christopher Barrington-Leigh and Adam Millard-Ball. 2017. The World’s User- Generated Road Map Is More Than 80% Complete.PLOS ONE12, 8 (2017), e0180698. https://doi.org/10.1371/journal.pone.0180698

  6. [6]

    Divya Bhadauria, Hazar Harmouch, Felix Naumann, Divesh Srivastava, and Lisa Ehrlinger. 2026. A Catalog of Data Errors.CoRRabs/2604.09277 (2026). https://doi.org/10.48550/ARXIV.2604.09277 arXiv:2604.09277

  7. [7]

    Felix Bießmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schel- ter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing Value Imputation for Tables.Journal of Machine Learning Research (JMLR)20, Article 175 (2019), 6 pages. https://jmlr.org/papers/v20/18-753.html

  8. [8]

    Alexander Brinkmann, Anna Primpeli, and Christian Bizer. 2023. The Web Data Commons Schema.org Data Set Series. InCompanion Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, Ying Ding, Jie Tang, Juan F. Sequeda, Lora Aroyo, Carlos Castillo, and Geert-Jan Houben (Eds.). ACM, 136–139. https://doi.org/10.1145/...

  9. [9]

    Ilyas, and Paolo Papotti

    Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic Data Cleaning: Putting Violations Into Context. InProceedings of the IEEE International Conference on Data Engineering (ICDE). 458–469. https://doi.org/10.1109/ICDE.2013.6544847

  10. [10]

    Fred Damerau. 1964. A Technique for Computer Detection and Correction of Spelling Errors.Communications of the ACM (CACM)7, 3 (1964), 171–176. https://doi.org/10.1145/363958.363994

  11. [11]

    Edith Desiree de Leeuw. 1992. Data Quality in Mail, Telephone and Face to Face Surveys. https://eric.ed.gov/?id=ED374136

  12. [12]

    Xiaoou Ding, Zekai Qian, Hongzhi Wang, Siying Chen, Yafeng Tang, Hongbin Su, Huan Hu, and Chen Wang. 2025. UniClean: A Scalable Data Cleaning Solution for Mixed Errors based on Unified Cleaners and Optimized Cleaning Workflow.Proceedings of the VLDB Endowment (PVLDB)18, 11 (2025), 4117–4130. https://doi.org/10.14778/3749646.3749681

  13. [13]

    Anna-Christina Glock, Christine Dominka-Kiss, Philipp Korom, and Lisa Ehrlinger. 2025. Detecting and Cleaning Errors in Personal Contact Information with Large Language Models.Proceedings of the VLDB Endowment (PVLDB) (2025)

  14. [14]

    Jean-Nicholas Hould. 2017. Craft Beers Dataset. https://www.kaggle.com/ nickhould/craft-cans. Version 1

  15. [15]

    Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependen- cies.Comput. J.42, 2 (1999), 100–111. https://doi.org/10.1093/COMJNL/42.2.100

  16. [16]

    Philipp Jung, Sebastian Jäger, Nicholas Chandler, and Felix Biessmann. 2025. Towards Realistic Error Models for Tabular Data.ACM J. Data Inf. Qual.17, 4 (2025), 28:1–28:27. https://doi.org/10.1145/3774914

  17. [17]

    Bojan Karlas, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions.Proceedings of the VLDB Endowment (PVLDB)14, 3 (2020), 255–267. https://doi.org/10.14778/3430915.3430917

  18. [18]

    Levenshtein

    Vladimir I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.Soviet Physics Doklady10, 8 (1966), 707–710

  19. [19]

    Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. InProceedings of the IEEE International Conference on Data Engineering (ICDE). https://doi.org/10.1109/ICDE51399.2021.00009

  20. [20]

    Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava

  21. [21]

    https://doi.org/10.14778/2535568

    Truth Finding on the Deep Web: Is the Problem Solved?Proceedings of the VLDB Endowment (PVLDB)6, 2 (2012), 97–108. https://doi.org/10.14778/2535568. 2448943

  22. [22]

    Yi Li and Gao Cong. 2025. GeoBloom: Revisiting Lightweight Models for Geo- graphic Information Retrieval.Proceedings of the VLDB Endowment (PVLDB)18, 5 (2025), 1348–1361. https://doi.org/10.14778/3718057.3718064

  23. [23]

    Yiding Liu, Tuan-Anh Nguyen Pham, Gao Cong, and Quan Yuan. 2017. An Experimental Evaluation of Point-of-interest Recommendation in Location-based Social Networks.Proceedings of the VLDB Endowment (PVLDB)10, 10 (2017), 1010–1021. https://doi.org/10.14778/3115404.3115407

  24. [24]

    Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning.Pro- ceedings of the VLDB Endowment (PVLDB)13, 11 (2020), 1948–1961. https: //doi.org/10.14778/3407790.3407801

  25. [25]

    Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Mad- den, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. InProceedings of the ACM In- ternational Conference on Management of Data (SIGMOD). 865–882. https: //doi.org/10.1145/3299869.3324956

  26. [26]

    Sedir Mohammed, Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Sina Noack, Hendrik Patzlaff, Felix Naumann, and Hazar Har- mouch. 2025. The effects of data quality on machine learning performance on tabular data.Information Systems132, Article 102549 (2025), 18 pages. https://doi.org/10.1016/j.is.2025.102549

  27. [27]

    Sedir Mohammed, Lisa Ehrlinger, Hazar Harmouch, Felix Naumann, and Divesh Srivastava. 2025. The Five Facets of Data Quality Assessment.ACM SIGMOD Record54, 2 (2025), 18–27. https://doi.org/10.1145/3749116.3749120

  28. [28]

    Wei Ni, Xiaoye Miao, Xiangyu Zhao, Yangyang Wu, Shuwei Liang, and Jianwei Yin. 2024. Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment (PVLDB)17, 10 (2024), 2617–2630. https://doi.org/10.14778/ 3675034.3675051

  29. [29]

    Wei Ni, Kaihang Zhang, Xiaoye Miao, Xiangyu Zhao, Yangyang Wu, Yaoshu Wang, and Jianwei Yin. 2025. ZeroED: Hybrid Zero-Shot Error Detection Through Large Language Model Reasoning. In41st IEEE International Conference on Data Engineering, ICDE 2025, Hong Kong, May 19-23, 2025. IEEE, 3126–3139. https: //doi.org/10.1109/ICDE65448.2025.00234

  30. [30]

    Mourad Ouzzani, Hossam Hammady, Zbys Fedorowicz, and Ahmed Elmagarmid

  31. [31]

    https://doi.org/10.1186/s13643-016-0384-4

    Rayyan—a web and mobile app for systematic reviews.Systematic Reviews 5, Article 210 (2016), 10 pages. https://doi.org/10.1186/s13643-016-0384-4

  32. [32]

    Ilyas, and Christopher Ré

    Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. Holo- Clean: Holistic Data Repairs with Probabilistic Inference.Proceedings of the VLDB Endowment (PVLDB)10, 11 (2017), 1190–1201. https://doi.org/10.14778/ 3137628.3137631

  33. [33]

    Valerie Restat, Gerrit Boerner, André Conrad, and Uta Störl. 2022. GouDa - Generation of universal Data Sets: Improving Analysis and Evaluation of Data Preparation Pipelines. InProceedings of the Workshop on Data Management for End-To-End Machine Learning (DEEM). Article 2, 6 pages. https://doi.org/10. 1145/3533028.3533311

  34. [34]

    Aref, Ahmed K

    El Kindi Rezig, Mourad Ouzzani, Walid G. Aref, Ahmed K. Elmagarmid, Ahmed R. Mahmood, and Michael Stonebraker. 2021. Horizon: Scalable Dependency-driven Data Cleaning.Proceedings of the VLDB Endowment (PVLDB)14, 11 (2021), 2546–

  35. [35]

    https://doi.org/10.14778/3476249.3476301

  36. [36]

    Sebastian Schelter, Tammo Rukat, and Felix Biessmann. 2021. JENGA - A Frame- work to Study the Impact of Data Errors on the Predictions of Machine Learning Models. InProceedings of the International Conference on Extending Database Technology (EDBT). 529–534. https://doi.org/10.5441/002/edbt.2021.63

  37. [37]

    Shafaq Siddiqi, Roman Kern, and Matthias Boehm. 2023. SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Ap- plications.Proceedings of the ACM on Management of Data (PACMMOD)1, 3, Article 218 (2023), 26 pages. https://doi.org/10.1145/3617338

  38. [38]

    Jasmin Singh and Heiko Gebauer. 2024. Clean Customer Master Data for Cus- tomer Analytics: A Neglected Element of Data Monetization.Digital4, 4 (2024), 1020–1038. https://doi.org/10.3390/digital4040051

  39. [39]

    Nishant Subramani, Sasha Luccioni, Jesse Dodge, and Margaret Mitchell. 2023. Detecting Personal Information in Training Corpora: an Analysis. InProceedings of the Workshop on Trustworthy Natural Language Processing (TrustNLP). 208–220. https://doi.org/10.18653/v1/2023.trustnlp-1.18

  40. [40]

    The Unicode Consortium. 2024. The Unicode Standard, Version 15.1. https: //www.unicode.org/versions/Unicode15.1.0/

  41. [41]

    United States Postal Service. 2014. Undeliverable as Addressed Mail. https: //www.uspsoig.gov/reports/audit-reports/undeliverable-addressed-mail

  42. [42]

    Yangyang Wu, Chen Yang, Mengying Zhu, Xiaoye Miao, Wei Ni, Meng Xi, Xinkui Zhao, and Jianwei Yin. 2025. A Zero-Training Error Correction System with Large Language Models. InProceedings of the IEEE International Conference on Data Engineering (ICDE). 2949–2962. https://doi.org/10.1109/ICDE65448.2025.00221 13