MaDI-Bench: An End-to-End Data Integration Benchmark
Pith reviewed 2026-06-30 03:25 UTC · model grok-4.3
The pith
MaDI-Bench supplies the first public benchmark for complete end-to-end relational data integration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MaDI-Bench contributes a set of base end-to-end data integration tasks spanning several application domains, each requiring the full schema matching, value normalization, entity matching, and conflict resolution pipeline, together with a generic method for deriving task variants that mitigates rapid benchmark saturation as data integration systems advance.
What carries the argument
The Mannheim Data Integration Benchmark consisting of base end-to-end tasks across domains and a generic variant-derivation method that together support evaluation of the full integration process.
Load-bearing premise
The chosen base tasks across domains plus the variant-derivation method are sufficient to represent realistic integration challenges and to keep the benchmark from saturating quickly as systems advance.
What would settle it
New integration systems reach near-perfect end-to-end accuracy across all current variants without needing additional tasks, or rankings on the benchmark diverge sharply from performance on real-world integration projects.
Figures
read the original abstract
Data integration combines heterogeneous data sets into a single, coherent representation. Data integration involves a sequence of interdependent tasks including schema matching, value normalization, entity blocking, entity matching, and data fusion. Existing benchmarks either evaluate these steps in isolation or cover only incomplete versions of the data integration pipeline, omitting specific steps. The lack of public end-to-end data integration benchmarks hinders research on data integration methods that address the integration process as a whole. This paper fills this gap by introducing the Mannheim Data Integration Benchmark (MaDI-Bench), the first benchmark for the end-to-end integration of relational tables covering all steps of the integration process. MaDI-Bench contributes (i) a set of base end-to-end data integration tasks spanning several application domains, each requiring the full schema matching, value normalization, entity matching, and conflict resolution pipeline; and (ii) a generic method for deriving task variants that mitigates rapid benchmark saturation as data integration systems advance. We validate the benchmark using human-engineered pipelines, a best-of-breed pipeline, and an LLM-based pipeline. The validation demonstrates the utility of the benchmark for measuring the step-wise as well as the end-to-end performance of data integration pipelines. All benchmark artifacts are available for public download.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce MaDI-Bench as the first benchmark for end-to-end data integration of relational tables, covering all steps of the integration process including schema matching, value normalization, entity blocking, entity matching, and data fusion. It provides a set of base tasks across domains and a generic method for deriving variants to prevent rapid saturation, validated using human-engineered, best-of-breed, and LLM-based pipelines to show utility for measuring step-wise and end-to-end performance.
Significance. If the benchmark tasks are representative of real-world challenges and the variant method effectively maintains difficulty, this work could provide a valuable standardized resource for advancing research on complete data integration pipelines, addressing the current lack of such benchmarks.
major comments (2)
- [Abstract] Abstract: The central claim that MaDI-Bench fills the gap as the first true end-to-end benchmark rests on the assumption that the chosen base tasks across domains plus the generic variant-derivation method produce instances requiring all pipeline steps in a non-trivial way and remain challenging; the provided text gives no details on task construction, domain selection criteria, or how variants introduce structural/semantic difficulties beyond surface features, leaving the realism and anti-saturation properties unverified.
- [Validation] Validation description: The validation with three pipeline types is asserted to demonstrate utility for step-wise and end-to-end measurement, but without reported quantitative results, task statistics, or exclusion criteria, the support for these claims cannot be assessed from the manuscript.
minor comments (1)
- [Abstract] Abstract: The enumerated steps include 'entity blocking' and 'conflict resolution' but the opening list omits blocking and uses 'data fusion'; ensure consistent terminology for the covered pipeline.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive suggestions. We address each of the major comments below, outlining our planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that MaDI-Bench fills the gap as the first true end-to-end benchmark rests on the assumption that the chosen base tasks across domains plus the generic variant-derivation method produce instances requiring all pipeline steps in a non-trivial way and remain challenging; the provided text gives no details on task construction, domain selection criteria, or how variants introduce structural/semantic difficulties beyond surface features, leaving the realism and anti-saturation properties unverified.
Authors: The referee correctly identifies that the manuscript would benefit from more explicit details on task construction and the variant method to substantiate the central claims. We will add a new subsection in the benchmark description that outlines the domain selection criteria, emphasizing diversity in schema complexity, data volume, and heterogeneity across domains. Additionally, we will elaborate on how the generic variant-derivation method introduces structural and semantic difficulties beyond surface-level changes, including examples of non-trivial variations. This will also include preliminary analysis supporting the realism and resistance to saturation. revision: yes
-
Referee: [Validation] Validation description: The validation with three pipeline types is asserted to demonstrate utility for step-wise and end-to-end measurement, but without reported quantitative results, task statistics, or exclusion criteria, the support for these claims cannot be assessed from the manuscript.
Authors: We agree that the validation claims require stronger quantitative backing. We will revise the validation section to include detailed quantitative results for the human-engineered, best-of-breed, and LLM-based pipelines, such as precision, recall, and F1 scores for each integration step and end-to-end. We will also report task statistics (e.g., number of source tables, attributes, and records) and specify exclusion criteria for any data points or pipeline runs. These additions will allow readers to better assess the utility for step-wise and end-to-end measurement. revision: yes
Circularity Check
No circularity: benchmark definition is the explicit contribution
full rationale
The paper defines and releases MaDI-Bench as a set of base end-to-end integration tasks plus a variant-generation procedure. No derivation, equation, or prediction is claimed; the central claim is simply that these artifacts constitute the first public end-to-end benchmark. No self-citations are load-bearing for any result, no parameters are fitted and then relabeled as predictions, and no uniqueness theorem or ansatz is smuggled in. The work is therefore self-contained by construction as a benchmark paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. Doan, A. Halevy, and Z. Ives,Principles of Data Integration. Morgan Kaufmann, 2012
2012
-
[2]
X. L. Dong and D. Srivastava,Big data integration. Morgan & Claypool Publishers, 2015
2015
-
[3]
Generic Schema Matching, Ten Years Later,
P. A. Bernstein, J. Madhavan, and E. Rahm, “Generic Schema Matching, Ten Years Later,”Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 695–701, 2011
2011
-
[4]
An Overview of End-to-End Entity Resolution for Big Data,
V . Christophides, V . Efthymiou, T. Palpanaset al., “An Overview of End-to-End Entity Resolution for Big Data,”ACM Computing Surveys, vol. 53, no. 6, pp. 1–42, 2020
2020
-
[5]
Data Fusion,
J. Bleiholder and F. Naumann, “Data Fusion,”ACM Computing Surveys, vol. 41, no. 1, pp. 1–41, 2009
2009
-
[6]
Automatic End-to-End Data Integration using Large Language Models,
A. Steiner and C. Bizer, “Automatic End-to-End Data Integration using Large Language Models,” 2026
2026
-
[7]
https://api.semanticscholar.org/CorpusID:282389107
Y . Zhu, L. Wang, C. Yanget al., “A Survey of Data Agents: Emerging Paradigm or Overstated Hype?”arXiv preprint arXiv:2510.23587, 2025
-
[8]
X. Zhou, J. He, W. Zhouet al., “A Survey of LLM×DATA,”arXiv preprint arXiv:2505.18458, 2025
-
[9]
Large Language Model-based Data Science Agent: A Survey,
K. Chen, P. Wang, Y . Yuet al., “Large Language Model-based Data Science Agent: A Survey,”arXiv preprint arXiv:2508.02744, 2025
-
[10]
Valentine: Evaluating matching techniques for dataset discovery,
C. Koutras, G. Siachamis, A. Ionescu, K. Psarakis, J. Brons, M. Fragk- oulis, C. Lofi, A. Bonifati, and A. Katsifodimos, “Valentine: Evaluating matching techniques for dataset discovery,” in2021 IEEE 37th Inter- national Conference on Data Engineering (ICDE). IEEE, 2021, pp. 468–479
2021
-
[11]
Deep learning for entity matching: A design space exploration,
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y . Park, G. Krishnan, R. Deep, E. Arcaute, and V . Raghavendra, “Deep learning for entity matching: A design space exploration,” inProceedings of the 2018 international conference on management of data, 2018, pp. 19–34
2018
-
[12]
Truth discovery with multiple conflicting information providers on the web,
X. Yin, J. Han, and P. S. Yu, “Truth discovery with multiple conflicting information providers on the web,” inProceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007, pp. 1048–1052
2007
-
[13]
TPC-DI: The First Industry Benchmark for Data Integration,
M. Poess, T. Rabl, H.-A. Jacobsenet al., “TPC-DI: The First Industry Benchmark for Data Integration,”Proceedings of the VLDB Endowment, vol. 7, no. 13, pp. 1367–1378, 2014
2014
-
[14]
ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines,
T. Jin, Y . Zhu, and D. Kang, “ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines,”Proceedings of the VLDB Endowment, vol. 19, no. 2, pp. 84–98, 2025
2025
-
[15]
WDC Products: A Multi- Dimensional Entity Matching Benchmark,
R. Peeters, R. C. Der, and C. Bizer, “WDC Products: A Multi- Dimensional Entity Matching Benchmark,” inProceedings of the 27th International Conference on Extending Database Technology, 2024, pp. 22–33
2024
-
[16]
COMA: A System for Flexible Combination of Schema Matching Approaches,
H.-H. Do and E. Rahm, “COMA: A System for Flexible Combination of Schema Matching Approaches,” inProceedings of the 28th International Conference on Very Large Data Bases, 2002, pp. 610–621
2002
-
[17]
Magneto: Combining Small and Large Language Models for Schema Matching,
Y . Liu, E. H. M. Pena, A. Santoset al., “Magneto: Combining Small and Large Language Models for Schema Matching,”Proceedings of the VLDB Endowment, vol. 18, no. 8, pp. 2681–2694, 2025
2025
-
[18]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019, pp. 3982–3992
2019
-
[19]
The Probabilistic Relevance Framework: BM25 and Beyond,
S. Robertson and H. Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond,”Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009
2009
-
[20]
SC-Block: Supervised Con- trastive Blocking Within Entity Resolution Pipelines,
A. Brinkmann, R. Shraga, and C. Bizer, “SC-Block: Supervised Con- trastive Blocking Within Entity Resolution Pipelines,” inThe Semantic Web: 21st International Conference, ESWC, 2024, pp. 121–142
2024
-
[21]
Deep Entity Matching with Pre-Trained Language Models,
Y . Li, J. Li, Y . Suharaet al., “Deep Entity Matching with Pre-Trained Language Models,”Proceedings of the VLDB Endowment, vol. 14, no. 1, pp. 50–60, 2020
2020
-
[22]
Magellan: Toward Building Entity Matching Management Systems,
P. Konda, S. Das, P. Suganthan G. C.et al., “Magellan: Toward Building Entity Matching Management Systems,”Proceedings of the VLDB Endowment, vol. 9, no. 12, pp. 1197–1208, 2016
2016
-
[23]
Truth Discovery with Multiple Conflicting Information Providers on the Web,
X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web,” inProceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 1048–1052
2007
-
[24]
A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration,
B. Zhao, B. I. P. Rubinstein, J. Gemmellet al., “A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration,” Proceedings of the VLDB Endowment, vol. 5, no. 6, pp. 550–561, 2012
2012
-
[25]
Integrating Conflicting Data: The Role of Source Dependence,
X. L. Dong, L. Berti-Equille, and D. Srivastava, “Integrating Conflicting Data: The Role of Source Dependence,”Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 550–561, 2009
2009
-
[26]
Truth Discovery by Claim and Source Embedding,
S. Lyu, W. Ouyang, H. Shenet al., “Truth Discovery by Claim and Source Embedding,” inProceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017, pp. 2183–2186
2017
-
[27]
FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous Data,
J. Zhu, Y . Mao, L. Chenet al., “FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous Data,”Proceedings of the VLDB Endowment, vol. 17, no. 6, pp. 1337–1349, 2024
2024
-
[28]
Divergence Measures Based on the Shannon Entropy,
J. Lin, “Divergence Measures Based on the Shannon Entropy,”IEEE Transactions on Information Theory, vol. 37, no. 1, pp. 145–151, 1991
1991
-
[29]
On Wasserstein Two- Sample Testing and Related Families of Nonparametric Tests,
A. Ramdas, N. Garc ´ıa Trillos, and M. Cuturi, “On Wasserstein Two- Sample Testing and Related Families of Nonparametric Tests,”Entropy, vol. 19, no. 2, p. 47, 2017
2017
-
[30]
FEBRL: An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface,
P. Christen, “FEBRL: An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 1065–1068
2008
-
[31]
Entity Matching using Large Language Models,
R. Peeters, A. Steiner, and C. Bizer, “Entity Matching using Large Language Models,” inProceedings of the 28th International Conference on Extending Database Technology, 2025, pp. 529–541
2025
-
[32]
Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching,
T. Wang, X. Chen, H. Linet al., “Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching,” in Proceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 96–109
2025
-
[33]
DIPBench Toolsuite: A Framework for Benchmarking Integration Systems,
M. B ¨ohm, D. Habich, W. Lehneret al., “DIPBench Toolsuite: A Framework for Benchmarking Integration Systems,” inProceedings of the 2008 IEEE 24th International Conference on Data Engineering, 2008, pp. 1596–1599
2008
-
[34]
DIBS: A Data Integration Benchmark Suite,
A. M. Cabrera, C. J. Faber, K. Cepedaet al., “DIBS: A Data Integration Benchmark Suite,” inCompanion of the 2018 ACM/SPEC International Conference on Performance Engineering, 2018, pp. 25–28
2018
-
[35]
Alaska: A Flexible Benchmark for Data Integration Tasks,
V . Crescenzi, A. De Angelis, D. Firmaniet al., “Alaska: A Flexible Benchmark for Data Integration Tasks,”arXiv preprint arXiv:2101.11259, 2021
-
[36]
LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes,
Y . Deng, C. Chai, L. Caoet al., “LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes,”Proceedings of the VLDB Endowment, vol. 17, no. 8, pp. 1925–1938, 2024
1925
-
[37]
Evaluation of Pipelines for Data Integration into Knowledge Graphs
M. Hofer and E. Rahm, “Evaluation of Pipelines for Data Integration into Knowledge Graphs,”arXiv preprint arXiv:2605.22304, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.