Efficient Model Repository for Entity Resolution: Construction, Search, and Integration

Peter Christen; Victor Christen

arxiv: 2412.09355 · v3 · submitted 2024-12-12 · 💻 cs.DB

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration

Victor Christen , Peter Christen This is my paper

Pith reviewed 2026-05-23 07:16 UTC · model grok-4.3

classification 💻 cs.DB

keywords entity resolutionmodel repositorymulti-source datamodel reusefeature distributiondata integrationclassification modelslabel efficiency

0 comments

The pith

MoRER constructs a repository of entity resolution models by clustering tasks through feature distribution analysis, allowing reuse across heterogeneous sources with moderate new labeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MoRER to handle entity resolution across multiple data sources where obtaining labels for every new task is costly. It groups similar resolution problems by examining how their features are distributed, then uses those groups to start models from existing ones instead of building each from zero. Experiments on three multi-source datasets show this produces accuracy comparable to active learning or transfer learning while using limited labels, and higher accuracy than self-supervised methods that rely on large pre-trained language models. When pitted against fully supervised transformer models, results match or exceed them once the training set grows beyond a certain size.

Core claim

By analyzing feature distributions to cluster entity resolution tasks, MoRER builds and searches a repository of classification models that can be initialized for new tasks with moderate labeling effort, delivering results on par with or better than label-limited baselines and outperforming self-supervised approaches on multi-source data.

What carries the argument

MoRER, a model repository construction method that clusters ER tasks via feature distribution analysis to support model search, initialization, and reuse.

If this is right

New multi-source entity resolution problems can be solved with fewer fresh labels by starting from repository models of clustered similar tasks.
Model repository search replaces repeated full training or large-scale self-supervision for each incoming data source combination.
Performance remains competitive with active learning and transfer learning while exceeding self-supervised pre-trained language model approaches on the tested datasets.
Results stay comparable to supervised transformer methods once training data volume increases beyond small sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clustering step could be applied to other record-linkage or deduplication settings outside the three datasets examined.
Repository growth over time might allow incremental addition of new models without rebuilding clusters from scratch.
Integration into existing data integration pipelines could lower total labeling cost when sources arrive sequentially.

Load-bearing premise

Feature distribution analysis can reliably group similar entity resolution tasks so that models from one task initialize usefully for another across different data sources.

What would settle it

A controlled test on new multi-source datasets where models started from the clustered repository show no accuracy gain over models trained from random initialization with the same labeling budget would falsify the core reuse benefit.

Figures

Figures reproduced from arXiv: 2412.09355 by Peter Christen, Victor Christen.

**Figure 1.** Figure 1: Motivation of reusing solved ER tasks for new tasks. The data sources 𝐷1 and 𝐷2 are already linked utilizing similarity feature vectors and a model 𝑀1,2 to label each record pair. The question is whether the derived model 𝑀1,2 can also be applied to the new data source, 𝐷3, to match it to 𝐷1 and 𝐷2, or if new models have to be generated. pairs into matches and non-matches based on attribute value similar… view at source ↗

**Figure 2.** Figure 2: Example of the similarity distributions using the [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Workflow for initializing and using an ER model repository consists of the steps: 1. Similarity Distribution Analysis, 2. ER Problem Clustering, 3. Model Generation, 4. Process new ER problems, 5. Classification. distribute the budget based on the number of similar feature vectors of the clusters. Moreover, we handle singleton clusters separately since we want to prioritize clusters with more than one ER p… view at source ↗

**Figure 4.** Figure 4: Example of integrating a new ER problem 𝑝3,5. The grey colored ER problems represent problems of 𝑇 . 𝑠𝑟 (𝑟) = 𝑙𝑜𝑔 |CP |𝑟 | |CP | (12) Almser: We use the original implementation1 and extend it to support batch processing. As input, we only consider the induced subgraph regarding the ER tasks of a cluster. In addition to the model generation, we maintain the selected vectors by the AL method as the set 𝑃𝐶𝑖 f… view at source ↗

**Figure 5.** Figure 5: Linkage quality comparison of MoRER to Almser standalone, Sudowoodo, AnyMatch, TransER, and Ditto. 0 1K 1.5K 2K 50% all Budget 10 0 10 2 10 4 Runtime(s) Dexter 0 1K 1.5K 2K 50% all Budget 10 4 WDC-computer 0 1K 1.5K 2K 50% all Budget 10 4 Music MoRER+Bootstrap MoRER+Almser MoRER Almser MultiEM Sudowoodo AnyMatch TransER Ditto [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Runtime comparison of MoRER to Almser standalone, TransER, AnyMatch, Sudowoodo, and Ditto. We evaluate AnyMatch [48] using its official GitHub implementation6 . For each dataset, the model is trained on the corresponding training set and evaluated on the test set, with the parameterized sample size 𝑛𝑟 for comparability. Note, AnyMatch filters relevant record pairs using the full ground truth of the traini… view at source ↗

**Figure 7.** Figure 7: Comparison of the distribution tests using the AL methods [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of the selection strategies 𝑠𝑒𝑙𝑏𝑎𝑠𝑒 and 𝑠𝑒𝑙𝑐𝑜𝑣 using Bootstrap AL (b=1000). are involved and a high number of problems, the combination with Almser should be used. In summary, Almser benefits from the reduced search space of candidate record pairs, allowing Almser to focus on the most informative instances for labeling. The combination can be applied to heterogeneous and noisy datasets. If runt… view at source ↗

read the original abstract

Entity resolution (ER) is a fundamental task in data integration that enables insights from heterogeneous data sources. The primary challenge of ER lies in classifying record pairs as matches or nonmatches, which in multi-source ER (MS-ER) scenarios can become complicated due to data source heterogeneity and scalability issues. Existing methods for MS-ER generally require labeled record pairs, and such methods fail to effectively reuse models across multiple ER tasks. We propose MoRER (Model Repositories for Entity Resolution), a novel method for building a model repository consisting of classification models that solve ER problems. By leveraging feature distribution analysis, MoRER clusters similar ER tasks, thereby enabling the effective initialization of a model repository with a moderate labeling effort. Experimental results on three multi-source datasets demonstrate that MoRER achieves comparable or better results to methods that have label-limited budgets, such as active learning and transfer learning approaches, while outperforming self-supervised approaches that utilize large pre-trained language models. When compared to supervised transformer-based methods, MoRER achieves comparable or better results, depending on the size of the training data set used.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoRER clusters ER tasks by feature distributions to enable model reuse with moderate labeling, but provides no direct check that those clusters track actual transfer performance.

read the letter

The paper's main move is to build a repository of ER classifiers by clustering tasks on feature distributions, then using the clusters to initialize models for new sources with limited new labels. This targets the practical pain of repeated labeling in multi-source entity resolution, where heterogeneity makes standard supervised or self-supervised routes expensive. The experiments on three multi-source datasets report results that match or beat active learning and transfer baselines while beating self-supervised PLM methods, and stay competitive with supervised transformers at smaller label budgets. That framing is useful for the ER subfield even if the gains are incremental rather than transformative. The approach is straightforward and the abstract states the claims clearly. The central assumption—that feature-distribution clusters reliably indicate where models will transfer—receives no quantitative support in the reported work. No silhouette scores, no correlation between distribution distance and cross-task F1, and no comparison of clustering against a performance-based similarity measure appear. On heterogeneous multi-source data this gap matters: if the clusters are driven by superficial overlap rather than ER-relevant similarity, the initialization benefit collapses to ordinary supervised training plus overhead. The abstract also gives no dataset details, metrics, or statistical tests, so the strength of the empirical claims cannot be judged from the summary alone. Readers working on label-efficient multi-source ER will find the repository idea worth examining. The paper is coherent on its own terms and engages the right literature, so it clears the bar for peer review despite the validation shortfall on the clustering step. I would send it out.

Referee Report

2 major / 1 minor

Summary. The paper proposes MoRER, a method to construct a model repository for multi-source entity resolution (ER) by using feature distribution analysis to cluster similar ER tasks. This clustering is intended to support effective model initialization and reuse across tasks with only moderate labeling effort. The central experimental claim is that MoRER achieves results comparable or superior to active learning and transfer learning baselines (which also operate under label budgets), outperforms self-supervised methods that rely on large pre-trained language models, and matches supervised transformer-based methods depending on training set size, all demonstrated on three multi-source datasets.

Significance. If the clustering step reliably identifies transferable models, the approach could meaningfully lower labeling costs in heterogeneous multi-source ER settings, a practical bottleneck in data integration. The repository concept itself is a reasonable engineering response to repeated ER tasks, but the significance hinges entirely on whether feature-distribution clusters align with actual transfer performance rather than superficial similarity.

major comments (2)

[Abstract] Abstract: the claim of 'comparable or better results' on three multi-source datasets is presented without any description of the datasets, evaluation metrics (e.g., F1), baselines, number of runs, error bars, or statistical tests. Because the entire contribution rests on these experimental comparisons rather than a derivation, the absence of these details prevents assessment of whether the results actually support the moderate-labeling claim.
[Method (feature distribution analysis)] Method section on feature distribution analysis and clustering: the central assumption that clusters derived from feature distributions group ER tasks by transferability (thereby enabling effective model reuse with moderate labels) receives no quantitative validation. No silhouette score, adjusted Rand index against a performance-based similarity matrix, or correlation between distribution distance and cross-task F1 is reported. On heterogeneous multi-source data this check is load-bearing; if the clusters are spurious, the initialization benefit reduces to standard supervised training plus overhead.

minor comments (1)

[Abstract] Abstract: the qualifier 'depending on the size of the training data set used' for the supervised-transformer comparison is imprecise and should be replaced by concrete sizes or a figure reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and commit to revisions that directly incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'comparable or better results' on three multi-source datasets is presented without any description of the datasets, evaluation metrics (e.g., F1), baselines, number of runs, error bars, or statistical tests. Because the entire contribution rests on these experimental comparisons rather than a derivation, the absence of these details prevents assessment of whether the results actually support the moderate-labeling claim.

Authors: We agree that the abstract would benefit from additional context on the experimental setup to allow proper assessment of the claims. In the revised manuscript, we will expand the abstract to include brief descriptions of the three multi-source datasets, specify F1-score as the primary metric, enumerate the baseline categories (active learning, transfer learning, self-supervised methods with pre-trained language models, and supervised transformer-based methods), and note that results are reported as averages over multiple runs with error bars (as detailed in the experimental section). revision: yes
Referee: [Method (feature distribution analysis)] Method section on feature distribution analysis and clustering: the central assumption that clusters derived from feature distributions group ER tasks by transferability (thereby enabling effective model reuse with moderate labels) receives no quantitative validation. No silhouette score, adjusted Rand index against a performance-based similarity matrix, or correlation between distribution distance and cross-task F1 is reported. On heterogeneous multi-source data this check is load-bearing; if the clusters are spurious, the initialization benefit reduces to standard supervised training plus overhead.

Authors: We acknowledge that the manuscript does not provide direct quantitative validation (such as silhouette scores or correlation between distribution distances and cross-task F1) linking the feature-distribution clusters to transferability. While the end-to-end results support the overall approach, we agree that explicit checks are needed to substantiate the clustering assumption on heterogeneous data. In the revision, we will add an analysis subsection reporting the correlation between pairwise feature distribution distances and observed cross-task F1 differences, along with silhouette scores for the derived clusters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method evaluated experimentally

full rationale

The paper proposes MoRER as an empirical approach to building model repositories for entity resolution via feature distribution analysis for task clustering, followed by experimental validation on three multi-source datasets. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or method description. Performance claims rest on direct comparisons to baselines (active learning, transfer learning, self-supervised PLM methods, supervised transformers) rather than any self-referential construction. The clustering step is presented as an input assumption whose effectiveness is tested externally via results, not defined in terms of the target transferability metric. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no specific free parameters, axioms, or invented entities identifiable.

pith-pipeline@v0.9.0 · 5718 in / 1017 out tokens · 20438 ms · 2026-05-23T07:16:19.793234+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By leveraging feature distribution analysis, MoRER clusters similar ER tasks... We construct an ER problem similarity graph GP... cluster the graph GP using the Leiden algorithm
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We utilize the determined aggregated similarity sim_p between ER problems of PI to build an entity resolution problem graph GP... partition the graph into multiple clusters of similar ER problems

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

[1]

Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. InACM SIGKDD. Washington DC, 39–48

work page 2003
[2]

Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Architectures - A Step Forward in Data Integration. InProceedings of the 23rd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, Angela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Böhm, Dan Olteanu,...

work page doi:10.5441/002/edbt.2020.58 2020
[3]

In: Proc

Peter Christen. 2012.Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer. https://doi.org/10.1007/978- 3-642-31164-2

work page doi:10.1007/978- 2012
[4]

Victor Christen, Peter Christen, and Erhard Rahm. 2019. Informativeness- Based Active Learning for Entity Resolution. InMachine Learning and Knowl- edge Discovery in Databases - International Workshops of ECML PKDD (Com- munications in Computer and Information Science), Vol. 1168. Springer, 125–141. https://doi.org/10.1007/978-3-030-43887-6_11

work page doi:10.1007/978-3-030-43887-6_11 2019
[5]

Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2021. An Overview of End-to-End Entity Resolution for Big Data.ACM Comput. Surv.53, 6 (2021), 127:1–127:42. https://doi.org/10.1145/ 3418896

work page 2021
[6]

Fernando de Meer Pardo, Claude Lehmann, Dennis Gehrig, Andrea Nagy, Ste- fano Nicoli, Branka Hadji Misheva, Martin Braschler, and Kurt Stockinger. 2025. GraLMatch: Matching Groups of Entities with Graphs and Language Models. InProceedings 28th International Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 2025, Alkis ...

work page doi:10.48786/edbt.2025.01 2025
[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, ...

work page doi:10.18653/v1/n19-1423 2019
[8]

C., Yash Govind, Derek Paulsen, Kaushik Chandrasekhar, Philip Martinkus, and Matthew Christie

AnHai Doan, Pradap Konda, Paul Suganthan G. C., Yash Govind, Derek Paulsen, Kaushik Chandrasekhar, Philip Martinkus, and Matthew Christie. 2020. Magellan: toward building ecosystems of entity matching solutions.Commun. ACM63, 8 (2020), 83–91. https://doi.org/10.1145/3405476

work page doi:10.1145/3405476 2020
[9]

Joty, Mourad Ouzzani, and Nan Tang

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2017. DeepER - Deep Entity Resolution.CoRR abs/1710.00597 (2017). arXiv:1710.00597 http://arxiv.org/abs/1710.00597

work page arXiv 2017
[10]

Girvan and M

M. Girvan and M. E. J. Newman. 2002. Community structure in social and biological networks.Proceedings of the National Academy of Sciences99, 12 (June 2002), 7821–7826. https://doi.org/10.1073/pnas.122653799

work page doi:10.1073/pnas.122653799 2002
[11]

Rihan Hai, Christos Koutras, Christoph Quix, and Matthias Jarke. 2023. Data Lakes: A Survey of Functions and Systems.IEEE Trans. Knowl. Data Eng.35, 12 (2023), 12571–12590. https://doi.org/10.1109/TKDE.2023.3270101

work page doi:10.1109/tkde.2023.3270101 2023
[12]

Kai Hildebrandt, Fabian Panse, Niklas Wilcke, and Norbert Ritter. 2020. Large- Scale Data Pollution with Apache Spark.IEEE Trans. Big Data6, 2 (2020), 396–411. https://doi.org/10.1109/TBDATA.2016.2637378

work page doi:10.1109/tbdata.2016.2637378 2020
[13]

Wolfe, and Eric Chicken

Myles Hollander, Douglas A. Wolfe, and Eric Chicken. 2013.Nonparametric Statistical Methods(3rd ed.). John Wiley & Sons

work page 2013
[14]

Di Jin, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Danai Koutra. 2021. Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation.Proc. VLDB Endow.15, 3 (2021), 465–477. https://doi.org/10.14778/3494124.3494131

work page doi:10.14778/3494124.3494131 2021
[15]

Nishadi Kirielle, Peter Christen, and Thilina Ranbaduge. 2022. TransER: Ho- mogeneous Transfer Learning for Entity Resolution. InProceedings of the 25th International Conference on Extending Database Technology, EDBT 2022, Edin- burgh, UK, March 29 - April 1, 2022. OpenProceedings.org, 2:118–2:130. https: //doi.org/10.48786/EDBT.2022.03

work page doi:10.48786/edbt.2022.03 2022
[16]

C., AnHai Doan, Adel Ardalan, Jeffrey R

Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward Building Entity Matching Management Systems.Proc. VLDB Endow.9, 12 (2016), 1197–1208. https://doi.org/10.14778/29...

work page doi:10.14778/2994509.2994535 2016
[17]

Hannah Köpcke, Andreas Thor, and Erhard Rahm. 2010. Learning-Based Ap- proaches for Matching Web Data Entities.IEEE Internet Computing14, 4 (2010), 23–31. https://doi.org/10.1109/MIC.2010.58

work page doi:10.1109/mic.2010.58 2010
[18]

Stefan Lerm, Alieh Saeedi, and Erhard Rahm. 2021. Extended Affinity Propa- gation Clustering for Multi-source Entity Resolution. InDatenbanksysteme für Business, Technologie und Web (BTW 2021), 19. Fachtagung des GI-Fachbereichs „Datenbanken und Informationssysteme" (DBIS), 13.-17. September 2021, Dres- den, Germany, Proceedings (LNI), Kai-Uwe Sattler, Me...

work page doi:10.18420/btw2021-11 2021
[19]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan

work page
[20]

VLDB Endow.14, 1 (2020), 50–60

Deep Entity Matching with Pre-Trained Language Models.Proc. VLDB Endow.14, 1 (2020), 50–60. https://doi.org/10.14778/3421424.3421431

work page doi:10.14778/3421424.3421431 2020
[21]

Robin Linacre, Sam Lindsay, Theodore Manassis, Zoe Slade, Tom Hepworth, Ross Kennedy, and Andrew Bond. 2022. Splink: Free software for probabilistic record linkage at scale.International Journal of Population Data Science7, 3 (Aug. 2022). https://doi.org/10.23889/ijpds.v7i3.1794

work page doi:10.23889/ijpds.v7i3.1794 2022
[22]

Koumarelas, and Felix Naumann

Michael Loster, Ioannis K. Koumarelas, and Felix Naumann. 2021. Knowledge Transfer for Entity Resolution with Siamese Neural Networks.ACM J. Data Inf. Qual.13, 1 (2021), 2:1–2:25. https://doi.org/10.1145/3410157

work page doi:10.1145/3410157 2021
[23]

Jakub Maciejewski, Konstantinos Nikoletos, George Papadakis, and Yannis Vele- grakis. 2025. Progressive Entity Matching: A Design Space Exploration.Proc. ACM Manag. Data3, 1, Article 65 (Feb. 2025), 25 pages. https://doi.org/10.1145/3709715

work page doi:10.1145/3709715 2025
[24]

Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling Up Crowd-sourcing to Very Large Datasets: A Case for Active Learning.PVLDB Endowment8, 2 (Oct. 2014), 125–136

work page 2014
[25]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. InProceedings of the 2018 International Conference on Management of Data(Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, N...

work page doi:10.1145/3183713.3196926 2018
[26]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. InProceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, Gautam Da...

work page doi:10.1145/3183713 2018
[27]

Axel-Cyrille Ngonga Ngomo and Klaus Lyko. 2012. EAGLE: Efficient Active Learning of Link Specifications Using Genetic Programming. InThe Semantic Web: Research and Applications. Berlin, Heidelberg, 149–163

work page 2012
[28]

Padmanabhan, L

S. Padmanabhan, L. Carty, E. Cameron, R. E. Ghosh, R. Williams, and H. Strong- man. 2019. Approach to record linkage of primary care data from Clini- cal Practice Research Datalink to other health-related patient data: overview and implications.European Journal of Epidemiology34, 1 (Jan 2019), 91–99. https://doi.org/10.1007/s10654-018-0442-4 Epub 2018 Sep 15

work page doi:10.1007/s10654-018-0442-4 2019
[29]

George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas

work page
[30]

Surv.53, 2 (2021), 31:1–31:42

Blocking and Filtering Techniques for Entity Resolution: A Survey.ACM Comput. Surv.53, 2 (2021), 31:1–31:42. https://doi.org/10.1145/3377455

work page doi:10.1145/3377455 2021
[31]

Ralph Peeters and Christian Bizer. 2023. Using ChatGPT for Entity Matching. In New Trends in Database and Information Systems - ADBIS 2023 (Communications in Computer and Information Science), Vol. 1850. Springer, 221–230. https://doi. org/10.1007/978-3-031-42941-5_20

work page doi:10.1007/978-3-031-42941-5_20 2023
[32]

Anna Primpeli and Christian Bizer. 2021. Graph-Boosted Active Learning for Multi-source Entity Resolution. InThe Semantic Web - ISWC 2021 - 20th Inter- national Semantic Web Conference, ISWC 2021, Virtual Event, October 24-28, 2021, Proceedings (Lecture Notes in Computer Science), Vol. 12922. Springer, 182–199. https://doi.org/10.1007/978-3-030-88361-4_11

work page doi:10.1007/978-3-030-88361-4_11 2021
[33]

Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The WDC Training Dataset and Gold Standard for Large-Scale Product Matching. InCompanion of The 2019 World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, Sihem Amer-Yahia, Mohammad Mahdian, Ashish Goel, Geert-Jan Houben, Kristina Lerman, Julian J. McAuley, Ricardo Baeza-Yate...

work page doi:10.1145/3308560.3316609 2019
[34]

Usha Nandini Raghavan, Réka Albert, and Soundar Kumara. 2007. Near linear time algorithm to detect community structures in large-scale networks.Phys. Rev. E76 (Sep 2007), 036106. Issue 3. https://doi.org/10.1103/PhysRevE.76.036106

work page doi:10.1103/physreve.76.036106 2007
[35]

Alieh Saeedi, Lucie David, and Erhard Rahm. 2021. Matching Entities from Mul- tiple Sources with Hierarchical Agglomerative Clustering. InIC3K. SCITEPRESS, 40–50. https://doi.org/10.5220/0010649600003064

work page doi:10.5220/0010649600003064 2021
[36]

Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2018. Using Link Features for Entity Clustering in Knowledge Graphs. InThe Semantic Web - 15th International Conference, ESWC 2018, Proceedings (Lecture Notes in Computer Science), Vol. 10843. Springer, 576–592. https://doi.org/10.1007/978-3-319-93417-4_37

work page doi:10.1007/978-3-319-93417-4_37 2018
[37]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Dis- tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.CoRR abs/1910.01108 (2019). arXiv:1910.01108 http://arxiv.org/abs/1910.01108

work page internal anchor Pith review Pith/arXiv arXiv 2019
[38]

Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural net- works.IEEE Trans. Signal Process.45, 11 (1997), 2673–2681. https://doi.org/10. 1109/78.650093

work page 1997
[39]

2005.Credit Risk Scorecards: Developing and Implementing Intelli- gent Credit Scoring

Naeem Siddiqi. 2005.Credit Risk Scorecards: Developing and Implementing Intelli- gent Credit Scoring. John Wiley & Sons

work page 2005
[40]

Michael Stonebraker and Ihab F. Ilyas. 2018. Data Integration: The Current Status and the Way Forward.IEEE Data Eng. Bull.41, 2 (2018), 3–9

work page 2018
[41]

Saravanan Thirumuruganathan, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, and AnHai Doan. 2021. Deep Learning for Blocking EDBT 2026, 24th March-27th March, 2026, Tampere, Finland Victor Christen and Peter Christen in Entity Matching: A Design Space Exploration.Proc. VLDB Endow.14, 11 (2021), 2459–2472. https://doi.org/10.1477...

work page doi:10.14778/3476249.3476294 2021
[42]

V. A. Traag, L. Waltman, and N. J. van Eck. 2019. From Louvain to Leiden: guaranteeing well-connected communities.Scientific Reports9, 1 (March 2019). https://doi.org/10.1038/s41598-019-41695-z

work page doi:10.1038/s41598-019-41695-z 2019
[43]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems 30: An- nual Conference on Neural Information Processing Systems 2017, December 4- 9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxbur...

work page 2017
[44]

, year =

Cédric Villani. 2009.Optimal Transport: Old and New. Springer-Verlag Berlin Heidelberg. https://doi.org/10.1007/978-3-540-71050-9

work page doi:10.1007/978-3-540-71050-9 2009
[45]

Qing Wang, Dinusha Vatsalan, and Peter Christen. 2015. Efficient Interactive Training Selection for Large-Scale Entity Resolution. InPAKDD. Ho Chi Minh City, Vietnam

work page 2015
[46]

Runhui Wang, Yuliang Li, and Jin Wang. 2023. Sudowoodo: Contrastive Self- supervised Learning for Multi-purpose Data Integration and Preparation. In39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023. IEEE, 1502–1515. https://doi.org/10.1109/ICDE55515.2023.00391

work page doi:10.1109/icde55515.2023.00391 2023
[47]

Timothé Watteau, Aubin Bonnefoy, Simon Illouz-Laurent, Joaquim Jusseau, and Serge Iovleff. 2024. Advanced Graph Clustering Methods: A Comprehensive and In-Depth Analysis. arXiv:2407.09055 [stat.ML] https://arxiv.org/abs/2407.09055

work page arXiv 2024
[48]

Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumu- ruganathan. 2020. ZeroER: Entity Resolution using Zero Labeled Examples. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Ta...

work page doi:10.1145/3318464 2020
[49]

Xiaocan Zeng, Pengfei Wang, Yuren Mao, Lu Chen, Xiaoze Liu, and Yunjun Gao. 2024. MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching. In40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, The Netherlands, May 13-16, 2024. IEEE, 3421–3434. https://doi.org/10. 1109/ICDE60146.2024.00264

work page arXiv 2024
[50]

Zeyu Zhang, Paul Groth, Iacer Calixto, and Sebastian Schelter. 2025. A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models. InProceedings 28th International Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 2025, Alkis Simitsis, Bettina Kemme, Anna Queralt, Oscar Romero, and Petar Jovanovi...

work page doi:10.48786/edbt.2025.75 2025
[51]

Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning. InThe World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 2413–2424. https://doi...

work page doi:10.1145/3308558.3313578 2019
[52]

2002.Learning from labeled and unlabeled data with label propagation

Xiaojin Zhu and Zoubin Ghahramani. 2002.Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107. Carnegie Mellon University

work page 2002

[1] [1]

Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. InACM SIGKDD. Washington DC, 39–48

work page 2003

[2] [2]

Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Architectures - A Step Forward in Data Integration. InProceedings of the 23rd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, Angela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Böhm, Dan Olteanu,...

work page doi:10.5441/002/edbt.2020.58 2020

[3] [3]

In: Proc

Peter Christen. 2012.Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer. https://doi.org/10.1007/978- 3-642-31164-2

work page doi:10.1007/978- 2012

[4] [4]

Victor Christen, Peter Christen, and Erhard Rahm. 2019. Informativeness- Based Active Learning for Entity Resolution. InMachine Learning and Knowl- edge Discovery in Databases - International Workshops of ECML PKDD (Com- munications in Computer and Information Science), Vol. 1168. Springer, 125–141. https://doi.org/10.1007/978-3-030-43887-6_11

work page doi:10.1007/978-3-030-43887-6_11 2019

[5] [5]

Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2021. An Overview of End-to-End Entity Resolution for Big Data.ACM Comput. Surv.53, 6 (2021), 127:1–127:42. https://doi.org/10.1145/ 3418896

work page 2021

[6] [6]

Fernando de Meer Pardo, Claude Lehmann, Dennis Gehrig, Andrea Nagy, Ste- fano Nicoli, Branka Hadji Misheva, Martin Braschler, and Kurt Stockinger. 2025. GraLMatch: Matching Groups of Entities with Graphs and Language Models. InProceedings 28th International Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 2025, Alkis ...

work page doi:10.48786/edbt.2025.01 2025

[7] [7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, ...

work page doi:10.18653/v1/n19-1423 2019

[8] [8]

C., Yash Govind, Derek Paulsen, Kaushik Chandrasekhar, Philip Martinkus, and Matthew Christie

AnHai Doan, Pradap Konda, Paul Suganthan G. C., Yash Govind, Derek Paulsen, Kaushik Chandrasekhar, Philip Martinkus, and Matthew Christie. 2020. Magellan: toward building ecosystems of entity matching solutions.Commun. ACM63, 8 (2020), 83–91. https://doi.org/10.1145/3405476

work page doi:10.1145/3405476 2020

[9] [9]

Joty, Mourad Ouzzani, and Nan Tang

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2017. DeepER - Deep Entity Resolution.CoRR abs/1710.00597 (2017). arXiv:1710.00597 http://arxiv.org/abs/1710.00597

work page arXiv 2017

[10] [10]

Girvan and M

M. Girvan and M. E. J. Newman. 2002. Community structure in social and biological networks.Proceedings of the National Academy of Sciences99, 12 (June 2002), 7821–7826. https://doi.org/10.1073/pnas.122653799

work page doi:10.1073/pnas.122653799 2002

[11] [11]

Rihan Hai, Christos Koutras, Christoph Quix, and Matthias Jarke. 2023. Data Lakes: A Survey of Functions and Systems.IEEE Trans. Knowl. Data Eng.35, 12 (2023), 12571–12590. https://doi.org/10.1109/TKDE.2023.3270101

work page doi:10.1109/tkde.2023.3270101 2023

[12] [12]

Kai Hildebrandt, Fabian Panse, Niklas Wilcke, and Norbert Ritter. 2020. Large- Scale Data Pollution with Apache Spark.IEEE Trans. Big Data6, 2 (2020), 396–411. https://doi.org/10.1109/TBDATA.2016.2637378

work page doi:10.1109/tbdata.2016.2637378 2020

[13] [13]

Wolfe, and Eric Chicken

Myles Hollander, Douglas A. Wolfe, and Eric Chicken. 2013.Nonparametric Statistical Methods(3rd ed.). John Wiley & Sons

work page 2013

[14] [14]

Di Jin, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Danai Koutra. 2021. Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation.Proc. VLDB Endow.15, 3 (2021), 465–477. https://doi.org/10.14778/3494124.3494131

work page doi:10.14778/3494124.3494131 2021

[15] [15]

Nishadi Kirielle, Peter Christen, and Thilina Ranbaduge. 2022. TransER: Ho- mogeneous Transfer Learning for Entity Resolution. InProceedings of the 25th International Conference on Extending Database Technology, EDBT 2022, Edin- burgh, UK, March 29 - April 1, 2022. OpenProceedings.org, 2:118–2:130. https: //doi.org/10.48786/EDBT.2022.03

work page doi:10.48786/edbt.2022.03 2022

[16] [16]

C., AnHai Doan, Adel Ardalan, Jeffrey R

Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward Building Entity Matching Management Systems.Proc. VLDB Endow.9, 12 (2016), 1197–1208. https://doi.org/10.14778/29...

work page doi:10.14778/2994509.2994535 2016

[17] [17]

Hannah Köpcke, Andreas Thor, and Erhard Rahm. 2010. Learning-Based Ap- proaches for Matching Web Data Entities.IEEE Internet Computing14, 4 (2010), 23–31. https://doi.org/10.1109/MIC.2010.58

work page doi:10.1109/mic.2010.58 2010

[18] [18]

Stefan Lerm, Alieh Saeedi, and Erhard Rahm. 2021. Extended Affinity Propa- gation Clustering for Multi-source Entity Resolution. InDatenbanksysteme für Business, Technologie und Web (BTW 2021), 19. Fachtagung des GI-Fachbereichs „Datenbanken und Informationssysteme" (DBIS), 13.-17. September 2021, Dres- den, Germany, Proceedings (LNI), Kai-Uwe Sattler, Me...

work page doi:10.18420/btw2021-11 2021

[19] [19]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan

work page

[20] [20]

VLDB Endow.14, 1 (2020), 50–60

Deep Entity Matching with Pre-Trained Language Models.Proc. VLDB Endow.14, 1 (2020), 50–60. https://doi.org/10.14778/3421424.3421431

work page doi:10.14778/3421424.3421431 2020

[21] [21]

Robin Linacre, Sam Lindsay, Theodore Manassis, Zoe Slade, Tom Hepworth, Ross Kennedy, and Andrew Bond. 2022. Splink: Free software for probabilistic record linkage at scale.International Journal of Population Data Science7, 3 (Aug. 2022). https://doi.org/10.23889/ijpds.v7i3.1794

work page doi:10.23889/ijpds.v7i3.1794 2022

[22] [22]

Koumarelas, and Felix Naumann

Michael Loster, Ioannis K. Koumarelas, and Felix Naumann. 2021. Knowledge Transfer for Entity Resolution with Siamese Neural Networks.ACM J. Data Inf. Qual.13, 1 (2021), 2:1–2:25. https://doi.org/10.1145/3410157

work page doi:10.1145/3410157 2021

[23] [23]

Jakub Maciejewski, Konstantinos Nikoletos, George Papadakis, and Yannis Vele- grakis. 2025. Progressive Entity Matching: A Design Space Exploration.Proc. ACM Manag. Data3, 1, Article 65 (Feb. 2025), 25 pages. https://doi.org/10.1145/3709715

work page doi:10.1145/3709715 2025

[24] [24]

Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling Up Crowd-sourcing to Very Large Datasets: A Case for Active Learning.PVLDB Endowment8, 2 (Oct. 2014), 125–136

work page 2014

[25] [25]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. InProceedings of the 2018 International Conference on Management of Data(Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, N...

work page doi:10.1145/3183713.3196926 2018

[26] [26]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. InProceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, Gautam Da...

work page doi:10.1145/3183713 2018

[27] [27]

Axel-Cyrille Ngonga Ngomo and Klaus Lyko. 2012. EAGLE: Efficient Active Learning of Link Specifications Using Genetic Programming. InThe Semantic Web: Research and Applications. Berlin, Heidelberg, 149–163

work page 2012

[28] [28]

Padmanabhan, L

S. Padmanabhan, L. Carty, E. Cameron, R. E. Ghosh, R. Williams, and H. Strong- man. 2019. Approach to record linkage of primary care data from Clini- cal Practice Research Datalink to other health-related patient data: overview and implications.European Journal of Epidemiology34, 1 (Jan 2019), 91–99. https://doi.org/10.1007/s10654-018-0442-4 Epub 2018 Sep 15

work page doi:10.1007/s10654-018-0442-4 2019

[29] [29]

George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas

work page

[30] [30]

Surv.53, 2 (2021), 31:1–31:42

Blocking and Filtering Techniques for Entity Resolution: A Survey.ACM Comput. Surv.53, 2 (2021), 31:1–31:42. https://doi.org/10.1145/3377455

work page doi:10.1145/3377455 2021

[31] [31]

Ralph Peeters and Christian Bizer. 2023. Using ChatGPT for Entity Matching. In New Trends in Database and Information Systems - ADBIS 2023 (Communications in Computer and Information Science), Vol. 1850. Springer, 221–230. https://doi. org/10.1007/978-3-031-42941-5_20

work page doi:10.1007/978-3-031-42941-5_20 2023

[32] [32]

Anna Primpeli and Christian Bizer. 2021. Graph-Boosted Active Learning for Multi-source Entity Resolution. InThe Semantic Web - ISWC 2021 - 20th Inter- national Semantic Web Conference, ISWC 2021, Virtual Event, October 24-28, 2021, Proceedings (Lecture Notes in Computer Science), Vol. 12922. Springer, 182–199. https://doi.org/10.1007/978-3-030-88361-4_11

work page doi:10.1007/978-3-030-88361-4_11 2021

[33] [33]

Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The WDC Training Dataset and Gold Standard for Large-Scale Product Matching. InCompanion of The 2019 World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, Sihem Amer-Yahia, Mohammad Mahdian, Ashish Goel, Geert-Jan Houben, Kristina Lerman, Julian J. McAuley, Ricardo Baeza-Yate...

work page doi:10.1145/3308560.3316609 2019

[34] [34]

Usha Nandini Raghavan, Réka Albert, and Soundar Kumara. 2007. Near linear time algorithm to detect community structures in large-scale networks.Phys. Rev. E76 (Sep 2007), 036106. Issue 3. https://doi.org/10.1103/PhysRevE.76.036106

work page doi:10.1103/physreve.76.036106 2007

[35] [35]

Alieh Saeedi, Lucie David, and Erhard Rahm. 2021. Matching Entities from Mul- tiple Sources with Hierarchical Agglomerative Clustering. InIC3K. SCITEPRESS, 40–50. https://doi.org/10.5220/0010649600003064

work page doi:10.5220/0010649600003064 2021

[36] [36]

Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2018. Using Link Features for Entity Clustering in Knowledge Graphs. InThe Semantic Web - 15th International Conference, ESWC 2018, Proceedings (Lecture Notes in Computer Science), Vol. 10843. Springer, 576–592. https://doi.org/10.1007/978-3-319-93417-4_37

work page doi:10.1007/978-3-319-93417-4_37 2018

[37] [37]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Dis- tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.CoRR abs/1910.01108 (2019). arXiv:1910.01108 http://arxiv.org/abs/1910.01108

work page internal anchor Pith review Pith/arXiv arXiv 2019

[38] [38]

Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural net- works.IEEE Trans. Signal Process.45, 11 (1997), 2673–2681. https://doi.org/10. 1109/78.650093

work page 1997

[39] [39]

2005.Credit Risk Scorecards: Developing and Implementing Intelli- gent Credit Scoring

Naeem Siddiqi. 2005.Credit Risk Scorecards: Developing and Implementing Intelli- gent Credit Scoring. John Wiley & Sons

work page 2005

[40] [40]

Michael Stonebraker and Ihab F. Ilyas. 2018. Data Integration: The Current Status and the Way Forward.IEEE Data Eng. Bull.41, 2 (2018), 3–9

work page 2018

[41] [41]

Saravanan Thirumuruganathan, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, and AnHai Doan. 2021. Deep Learning for Blocking EDBT 2026, 24th March-27th March, 2026, Tampere, Finland Victor Christen and Peter Christen in Entity Matching: A Design Space Exploration.Proc. VLDB Endow.14, 11 (2021), 2459–2472. https://doi.org/10.1477...

work page doi:10.14778/3476249.3476294 2021

[42] [42]

V. A. Traag, L. Waltman, and N. J. van Eck. 2019. From Louvain to Leiden: guaranteeing well-connected communities.Scientific Reports9, 1 (March 2019). https://doi.org/10.1038/s41598-019-41695-z

work page doi:10.1038/s41598-019-41695-z 2019

[43] [43]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems 30: An- nual Conference on Neural Information Processing Systems 2017, December 4- 9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxbur...

work page 2017

[44] [44]

, year =

Cédric Villani. 2009.Optimal Transport: Old and New. Springer-Verlag Berlin Heidelberg. https://doi.org/10.1007/978-3-540-71050-9

work page doi:10.1007/978-3-540-71050-9 2009

[45] [45]

Qing Wang, Dinusha Vatsalan, and Peter Christen. 2015. Efficient Interactive Training Selection for Large-Scale Entity Resolution. InPAKDD. Ho Chi Minh City, Vietnam

work page 2015

[46] [46]

Runhui Wang, Yuliang Li, and Jin Wang. 2023. Sudowoodo: Contrastive Self- supervised Learning for Multi-purpose Data Integration and Preparation. In39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023. IEEE, 1502–1515. https://doi.org/10.1109/ICDE55515.2023.00391

work page doi:10.1109/icde55515.2023.00391 2023

[47] [47]

Timothé Watteau, Aubin Bonnefoy, Simon Illouz-Laurent, Joaquim Jusseau, and Serge Iovleff. 2024. Advanced Graph Clustering Methods: A Comprehensive and In-Depth Analysis. arXiv:2407.09055 [stat.ML] https://arxiv.org/abs/2407.09055

work page arXiv 2024

[48] [48]

Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumu- ruganathan. 2020. ZeroER: Entity Resolution using Zero Labeled Examples. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Ta...

work page doi:10.1145/3318464 2020

[49] [49]

Xiaocan Zeng, Pengfei Wang, Yuren Mao, Lu Chen, Xiaoze Liu, and Yunjun Gao. 2024. MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching. In40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, The Netherlands, May 13-16, 2024. IEEE, 3421–3434. https://doi.org/10. 1109/ICDE60146.2024.00264

work page arXiv 2024

[50] [50]

Zeyu Zhang, Paul Groth, Iacer Calixto, and Sebastian Schelter. 2025. A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models. InProceedings 28th International Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 2025, Alkis Simitsis, Bettina Kemme, Anna Queralt, Oscar Romero, and Petar Jovanovi...

work page doi:10.48786/edbt.2025.75 2025

[51] [51]

Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning. InThe World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 2413–2424. https://doi...

work page doi:10.1145/3308558.3313578 2019

[52] [52]

2002.Learning from labeled and unlabeled data with label propagation

Xiaojin Zhu and Zoubin Ghahramani. 2002.Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107. Carnegie Mellon University

work page 2002