Graph-based Active Learning for Entity Cluster Repair

Daniel Obraczka; Erhard Rahm; Martin Franke; Marvin Hofer; Victor Christen

arxiv: 2401.14992 · v1 · submitted 2024-01-26 · 💻 cs.LG · cs.DB

Graph-based Active Learning for Entity Cluster Repair

Victor Christen , Daniel Obraczka , Marvin Hofer , Martin Franke , Erhard Rahm This is my paper

Pith reviewed 2026-05-24 04:47 UTC · model grok-4.3

classification 💻 cs.LG cs.DB

keywords graph-based active learningentity cluster repairsimilarity graphsactive learningentity resolutioncluster repairdata integrationduplicate detection

0 comments

The pith

A classifier trained on graph metrics from similarity graphs, combined with cluster-specific active learning, repairs entity clusters more accurately than existing methods on both duplicate-free and dirty data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a graph-based method for repairing clusters of entity records. It computes metrics on the similarity graph to train a model that identifies which edges are incorrect. An active learning approach tailored to individual clusters helps gather the necessary labels efficiently. This allows the method to work without assuming the data sources are duplicate-free, unlike many prior techniques. The results indicate improved performance, especially when duplicates are present in the data.

Core claim

The central discovery is that graph metrics derived from the underlying similarity graphs can be used to construct a classification model that distinguishes between correct and incorrect edges in entity clusters. By integrating this with an active learning mechanism that is tailored to cluster-specific attributes, the approach addresses the scarcity of labeled training data. This enables effective cluster repair that does not require the assumption of duplicate-free data sources and shows enhanced performance on datasets containing duplicates.

What carries the argument

Graph metrics from similarity graphs used to train a classifier for identifying erroneous edges, paired with a cluster-specific active learning strategy to select informative training examples.

If this is right

The method can be applied to both duplicate-free and data sources with duplicates without modification.
The modified active learning strategy improves results specifically when duplicates are present.
Existing cluster repair methods can be outperformed by leveraging graph-based classification.
Cluster repair quality becomes less dependent on the configuration and dataset characteristics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the graph metrics prove robust across domains, this approach could simplify entity resolution pipelines by reducing the need for separate duplicate detection steps.
Similar techniques using graph metrics for edge classification might extend to other tasks like link prediction in knowledge graphs.
Further work could test whether the active learning reduces labeling effort by a measurable factor in large datasets.

Load-bearing premise

Graph metrics on the similarity graph provide enough information for a classifier to accurately separate correct from incorrect edges, and the active learning can sufficiently compensate for few initial labels.

What would settle it

Running the classifier on a dataset where edge correctness is known but the graph metrics show no statistical difference between correct and incorrect edges would disprove the utility of the approach.

Figures

Figures reproduced from arXiv: 2401.14992 by Daniel Obraczka, Erhard Rahm, Martin Franke, Marvin Hofer, Victor Christen.

**Figure 2.** Figure 2: Overview of the graph-based cluster repair method. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Example of the iterative cluster repair procedure showing 6 records of an [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Results on Music Brainz and Dexter(C0, C50, C100) datasets with dif [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: F1-score results of our proposed approach (GraphCR) as compared with [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Decision matrix comparing cluster repair approaches using Bayesian [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Results on Dexter and MusicBrainz datasets with various error ratios of [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Cluster repair methods aim to determine errors in clusters and modify them so that each cluster consists of records representing the same entity. Current cluster repair methodologies primarily assume duplicate-free data sources, where each record from one source corresponds to a unique record from another. However, real-world data often deviates from this assumption due to quality issues. Recent approaches apply clustering methods in combination with link categorization methods so they can be applied to data sources with duplicates. Nevertheless, the results do not show a clear picture since the quality highly varies depending on the configuration and dataset. In this study, we introduce a novel approach for cluster repair that utilizes graph metrics derived from the underlying similarity graphs. These metrics are pivotal in constructing a classification model to distinguish between correct and incorrect edges. To address the challenge of limited training data, we integrate an active learning mechanism tailored to cluster-specific attributes. The evaluation shows that the method outperforms existing cluster repair methods without distinguishing between duplicate-free or dirty data sources. Notably, our modified active learning strategy exhibits enhanced performance when dealing with datasets containing duplicates, showcasing its effectiveness in such scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines graph metrics on similarity graphs with cluster-specific active learning to classify edges in entity clusters and claims it works on both clean and duplicate-containing sources, but the abstract supplies no numbers or setup details to check the outperformance claim.

read the letter

The core contribution is a pipeline that extracts graph metrics from the similarity graph to train a classifier distinguishing correct from incorrect edges, then applies active learning tuned to cluster attributes to handle scarce labels. This is positioned as an improvement over prior cluster repair methods that either assumed duplicate-free sources or relied on generic clustering plus link categorization whose results varied a lot by setup and data. The paper does a clear job of noting that real sources often contain internal duplicates and that the active learning modification is meant to help more in those cases. That framing is useful for anyone who has run into the duplicate-free assumption in entity resolution work. The approach itself looks like a straightforward supervised classification step plus a targeted sampling strategy, with no equations or derivations that introduce circularity or hidden fitting. The main limitation is that the abstract asserts outperformance without any quantitative results, dataset names, baseline descriptions, or test details, so the central empirical claim cannot be evaluated from the given text. If the full paper contains properly reported experiments with reproducible baselines and clear gains, that would address the gap; without them the soundness stays thin. This is aimed at researchers working on entity resolution and data integration pipelines that must handle imperfect sources. A reader already familiar with cluster repair might pick up the graph-feature idea or the active learning tweak for their own setups. I would send it to peer review if the experiments are solid and reported in full, because the underlying problem is practical and the direction is reasonable even if the results need close checking.

Referee Report

2 major / 1 minor

Summary. The paper presents a graph-based active learning method for entity cluster repair. Graph metrics computed on similarity graphs are used to train a supervised classifier that labels edges as correct or incorrect. A cluster-specific active learning strategy is added to mitigate label scarcity. The central claim is that this approach outperforms prior cluster repair methods uniformly on both duplicate-free and dirty data sources, with the modified active learning providing particular gains on duplicate-containing datasets.

Significance. If the empirical claims are substantiated, the work would address a practical gap in entity resolution by removing the common duplicate-free source assumption and offering a label-efficient alternative via graph features and tailored active learning. This could improve robustness in real-world dirty-data settings.

major comments (2)

[Abstract] Abstract: the claim that 'the evaluation shows that the method outperforms existing cluster repair methods' and that the modified active learning 'exhibits enhanced performance' on duplicates is asserted without any quantitative results, dataset descriptions, baseline implementations, or statistical tests. This prevents verification of the central empirical claim.
[Evaluation] Evaluation section: the assumption that graph metrics on the similarity graph are sufficiently discriminative to train an accurate edge classifier, and that the cluster-specific active learning overcomes label scarcity, is load-bearing for the outperformance claim yet lacks supporting experimental detail, baseline comparisons, or ablation results in the provided text.

minor comments (1)

[Abstract] The abstract phrasing 'outperforms existing cluster repair methods without distinguishing between duplicate-free or dirty data sources' is ambiguous and should be clarified to indicate whether the method requires no prior knowledge of data type or simply achieves comparable results across types.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and evaluation section. We agree that the central empirical claims require more concrete quantitative support and experimental detail to be verifiable. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'the evaluation shows that the method outperforms existing cluster repair methods' and that the modified active learning 'exhibits enhanced performance' on duplicates is asserted without any quantitative results, dataset descriptions, baseline implementations, or statistical tests. This prevents verification of the central empirical claim.

Authors: We accept the point. The current abstract states the outperformance claims at a high level without numbers or specifics. In the revised manuscript we will update the abstract to include key quantitative results (e.g., F1 or accuracy deltas versus baselines on the evaluated datasets), brief dataset descriptions, and references to statistical tests. This will make the central claim directly verifiable. revision: yes
Referee: [Evaluation] Evaluation section: the assumption that graph metrics on the similarity graph are sufficiently discriminative to train an accurate edge classifier, and that the cluster-specific active learning overcomes label scarcity, is load-bearing for the outperformance claim yet lacks supporting experimental detail, baseline comparisons, or ablation results in the provided text.

Authors: We agree that the evaluation section needs expansion to substantiate the load-bearing assumptions. While the manuscript describes the graph metrics and cluster-specific active learning, we will add (i) ablation results showing the discriminative power of the graph metrics for the edge classifier, (ii) explicit baseline comparisons with prior cluster repair methods, and (iii) experiments demonstrating how the active learning strategy mitigates label scarcity, with particular attention to gains on duplicate-containing datasets. These additions will directly support the outperformance claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a graph-metric classifier plus cluster-specific active learning for entity cluster repair. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or described structure. The central claims rest on standard supervised learning assumptions rather than any derivation that reduces to its own inputs by construction. This is the normal non-circular outcome for an applied ML methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method is described as relying on standard graph metrics and off-the-shelf classification plus active-learning techniques.

pith-pipeline@v0.9.0 · 5723 in / 1119 out tokens · 27288 ms · 2026-05-24T04:47:23.597072+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

In: ACM SIGMOD

Arasu, A., G¨ otz, M., Kaushik, R.: On active learning of record matching packages. In: ACM SIGMOD. pp. 783–794. Indianapolis (2010). https: //doi.org/10.1145/1807167.1807252

work page doi:10.1145/1807167.1807252 2010
[2]

In: ACM SIGKDD

Bellare, K., Iyengar, S., Parameswaran, A.G., Rastogi, V.: Active sampling for entity matching. In: ACM SIGKDD. pp. 1131–1139. Beijing (2012). https://doi.org/10.1145/2339530.2339707

work page doi:10.1145/2339530.2339707 2012
[3]

Benavoli, A., Corani, G., Demsar, J., Zaffalon, M.: Time for a change: a tu- torial for comparing multiple classifiers through bayesian analysis. J. Mach. Learn. Res. 18, 77:1–77:36 (2017), http://jmlr.org/papers/v18/16-305 .html

work page 2017
[4]

In: Proceedings of the 31th International Conference on Machine Learning, ICML

Benavoli, A., Corani, G., Mangili, F., Zaffalon, M., Ruggeri, F.: A bayesian wilcoxon signed-rank test based on the dirichlet process. In: Proceedings of the 31th International Conference on Machine Learning, ICML. JMLR Workshop and Conference Proceedings, vol. 32, pp. 1026–1034. JMLR.org (2014), http://proceedings.mlr.press/v32/benavoli14.html

work page 2014
[5]

Springer (2012)

Christen, P.: Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer (2012). https://do i.org/10.1007/978-3-642-31164-2

work page doi:10.1007/978-3-642-31164-2 2012
[6]

In: Machine Learning and Knowledge Discovery in Databases - International Workshops of ECML PKDD

Christen, V., Christen, P., Rahm, E.: Informativeness-based active learning for entity resolution. In: Machine Learning and Knowledge Discovery in Databases - International Workshops of ECML PKDD. Communications in Computer and Information Science, vol. 1168, pp. 125–141. Springer (2019). https://doi.org/10.1007/978-3-030-43887-6\_11

work page doi:10.1007/978-3-030-43887-6 2019
[7]

Doan, A., Konda, P., C., P.S.G., Govind, Y., Paulsen, D., Chandrasekhar, K., Martinkus, P., Christie, M.: Magellan: toward building ecosystems of entity matching solutions. Commun. ACM 63(8), 83–91 (2020). https: //doi.org/10.1145/3405476, https://doi.org/10.1145/3405476

work page doi:10.1145/3405476 2020
[8]

Undergraduate Texts in Mathematics, Springer (2008)

Harris, J.M., Hirst, J.L., Mossinghoff, M.J.: Combinatorics and Graph The- ory, Second Edition. Undergraduate Texts in Mathematics, Springer (2008)

work page 2008
[9]

Herbold, S.: Autorank: A python package for automated ranking of classi- fiers. J. Open Source Softw. 5(48), 2173 (2020). https://doi.org/10.2 1105/JOSS.02173, https://doi.org/10.21105/joss.02173

work page doi:10.21105/joss.02173 2020
[10]

IEEE Trans

Hildebrandt, K., Panse, F., Wilcke, N., Ritter, N.: Large-scale data pollution with apache spark. IEEE Trans. Big Data 6(2), 396–411 (2020). https: //doi.org/10.1109/TBDATA.2016.2637378

work page doi:10.1109/tbdata.2016.2637378 2020
[11]

arXiv preprint (2023)

Hofer, M., Obraczka, D., Saeedi, A., Kopcke, H., Rahm, E.: Construction of knowledge graphs: State and challenges. arXiv preprint (2023). https: //doi.org/https://doi.org/10.48550/arXiv.2302.11509

work page doi:10.48550/arxiv.2302.11509 2023
[12]

In: Datenbanksysteme f¨ ur Business, Tech- nologie und Web (BTW)

Lerm, S., Saeedi, A., Rahm, E.: Extended affinity propagation clustering for multi-source entity resolution. In: Datenbanksysteme f¨ ur Business, Tech- nologie und Web (BTW). pp. 217–236 (2021). https://doi.org/10.184 20/btw2021-11 Graph-based Active Learning for Entity Cluster Repair 17

work page 2021
[13]

In: Thirty-Fifth AAAI Conference on Artificial Intelligence

Li, B., Miao, Y., Wang, Y., Sun, Y., Wang, W.: Improving the efficiency and effectiveness for bert-based entity resolution. In: Thirty-Fifth AAAI Conference on Artificial Intelligence. pp. 13226–13233. AAAI Press (2021). https://doi.org/10.1609/AAAI.V35I15.17562

work page doi:10.1609/aaai.v35i15.17562 2021
[14]

PVLDB Endowment 8(2), 125–136 (Oct 2014)

Mozafari, B., Sarkar, P., Franklin, M., Jordan, M., Madden, S.: Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB Endowment 8(2), 125–136 (Oct 2014)

work page 2014
[15]

In: Das, G., Jermaine, C.M., Bernstein, P.A

Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: A design space exploration. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data. pp. 19–34. ACM (2018). https://doi.org/10.1145/3183713.3196926

work page doi:10.1145/3183713.3196926 2018
[16]

Higman’s Lemma and its Computational Content

Nentwig, M., Groß, A., M¨ oller, M., Rahm, E.: Distributed holistic clustering on linked data. In: On the Move to Meaningful Internet Systems. OTM 2017 Conferences - Confederated International Conferences: CoopIS, C&TC, and ODBASE 2017, Proceedings, Part II. Lecture Notes in Computer Science, vol. 10574, pp. 371–382. Springer (2017). https://doi.org/10.10...

work page doi:10.1007/97 2017
[17]

Semantic Web 8(3), 419–436 (2017)

Nentwig, M., Hartung, M., Ngomo, A.N., Rahm, E.: A survey of current link discovery frameworks. Semantic Web 8(3), 419–436 (2017). https: //doi.org/10.3233/SW-150210, https://doi.org/10.3233/SW-150210

work page doi:10.3233/sw-150210 2017
[18]

Newman, M.E.J.: Networks: An introduction (2010), https://api.semant icscholar.org/CorpusID:60557556

work page 2010
[19]

K¨ unstliche Intell.35(3), 413–423 (2021)

Ngomo, A.N., Sherif, M.A., Georgala, K., Hassan, M.M., Dreßler, K., Lyko, K., Obraczka, D., Soru, T.: LIMES: A framework for link discovery on the semantic web. K¨ unstliche Intell.35(3), 413–423 (2021). https://doi.or g/10.1007/S13218-021-00713-X , https://doi.org/10.1007/s13218-0 21-00713-x

work page doi:10.1007/s13218-021-00713-x 2021
[20]

In: The Semantic Web: Trends and Challenges - 11th International Conference, ESWC 2014, Proceedings

Ngomo, A.N., Sherif, M.A., Lyko, K.: Unsupervised link discovery through knowledge base repair. In: The Semantic Web: Trends and Challenges - 11th International Conference, ESWC 2014, Proceedings. Lecture Notes in Computer Science, vol. 8465, pp. 380–394. Springer (2014). https://doi. org/10.1007/978-3-319-07443-6\_26

work page doi:10.1007/978-3-319-07443-6 2014
[21]

In: The Semantic Web: Research and Applications

Ngonga Ngomo, A.C., Lyko, K.: Eagle: Efficient active learning of link spec- ifications using genetic programming. In: The Semantic Web: Research and Applications. pp. 149–163. Berlin, Heidelberg (2012)

work page 2012
[22]

arXiv preprint (2023)

Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., Wu, X.: Unifying large language models and knowledge graphs: A roadmap. arXiv preprint (2023). https://doi.org/10.48550/ARXIV.2306.08302

work page doi:10.48550/arxiv.2306.08302 2023
[23]

In: Abell´ o, A., Vassiliadis, P., Romero, O., Wrembel, R., Bugiotti, F., Gamper, J., Vargas- Solar, G., Zumpano, E

Peeters, R., Bizer, C.: Using ChatGPT for entity matching. In: Abell´ o, A., Vassiliadis, P., Romero, O., Wrembel, R., Bugiotti, F., Gamper, J., Vargas- Solar, G., Zumpano, E. (eds.) New Trends in Database and Information Systems - ADBIS 2023. Communications in Computer and Information Sci- ence, vol. 1850, pp. 221–230. Springer (2023). https://doi.org/10...

work page 2023
[24]

In: The Semantic Web - ISWC 2021 - 20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24-28, 2021, Proceedings

Primpeli, A., Bizer, C.: Graph-boosted active learning for multi-source en- tity resolution. In: The Semantic Web - ISWC 2021 - 20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24-28, 2021, Proceedings. Lecture Notes in Computer Science, vol. 12922, pp. 182–199. Springer (2021). https://doi.org/10.1007/978-3-030-88361-4\_11

work page doi:10.1007/978-3-030-88361-4 2021
[25]

In: IC3K

Saeedi, A., David, L., Rahm, E.: Matching entities from multiple sources with hierarchical agglomerative clustering. In: IC3K. pp. 40–50. SCITEPRESS (2021). https://doi.org/10.5220/0010649600003064

work page doi:10.5220/0010649600003064 2021
[26]

In: The Semantic Web - 15th International Conference, ESWC 2018, Proceedings

Saeedi, A., Peukert, E., Rahm, E.: Using link features for entity clustering in knowledge graphs. In: The Semantic Web - 15th International Conference, ESWC 2018, Proceedings. Lecture Notes in Computer Science, vol. 10843, pp. 576–592. Springer (2018). https://doi.org/10.1007/978-3-319-934 17-4\_37

work page doi:10.1007/978-3-319-934 2018
[27]

In: ESWC

Saeedi, A., Peukert, E., Rahm, E.: Incremental multi-source entity resolu- tion for knowledge graph completion. In: ESWC. vol. 12123, pp. 393–408. Springer (2020). https://doi.org/10.1007/978-3-030-49461-2_23

work page doi:10.1007/978-3-030-49461-2_23 2020
[28]

In: Chirkova, R., Dogac, A., ¨Ozsu, M.T., Sellis, T.K

Shen, W., DeRose, P., Vu, L.H., Doan, A., Ramakrishnan, R.: Source-aware entity matching: A compositional approach. In: Chirkova, R., Dogac, A., ¨Ozsu, M.T., Sellis, T.K. (eds.) Proceedings of the 23rd International Con- ference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007. pp. 196–205. IEEE Computer Society (2007...

work page doi:10.1109/icde.2007.367865 2007
[29]

arXiv preprint (2023)

Yang, L., Chen, H., Li, Z., Ding, X., Wu, X.: ChatGPT is not enough: En- hancing large language models with knowledge graphs for fact-aware lan- guage modeling. arXiv preprint (2023). https://doi.org/10.48550/ARX IV.2306.11489

work page doi:10.48550/arx 2023

[1] [1]

In: ACM SIGMOD

Arasu, A., G¨ otz, M., Kaushik, R.: On active learning of record matching packages. In: ACM SIGMOD. pp. 783–794. Indianapolis (2010). https: //doi.org/10.1145/1807167.1807252

work page doi:10.1145/1807167.1807252 2010

[2] [2]

In: ACM SIGKDD

Bellare, K., Iyengar, S., Parameswaran, A.G., Rastogi, V.: Active sampling for entity matching. In: ACM SIGKDD. pp. 1131–1139. Beijing (2012). https://doi.org/10.1145/2339530.2339707

work page doi:10.1145/2339530.2339707 2012

[3] [3]

Benavoli, A., Corani, G., Demsar, J., Zaffalon, M.: Time for a change: a tu- torial for comparing multiple classifiers through bayesian analysis. J. Mach. Learn. Res. 18, 77:1–77:36 (2017), http://jmlr.org/papers/v18/16-305 .html

work page 2017

[4] [4]

In: Proceedings of the 31th International Conference on Machine Learning, ICML

Benavoli, A., Corani, G., Mangili, F., Zaffalon, M., Ruggeri, F.: A bayesian wilcoxon signed-rank test based on the dirichlet process. In: Proceedings of the 31th International Conference on Machine Learning, ICML. JMLR Workshop and Conference Proceedings, vol. 32, pp. 1026–1034. JMLR.org (2014), http://proceedings.mlr.press/v32/benavoli14.html

work page 2014

[5] [5]

Springer (2012)

Christen, P.: Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer (2012). https://do i.org/10.1007/978-3-642-31164-2

work page doi:10.1007/978-3-642-31164-2 2012

[6] [6]

In: Machine Learning and Knowledge Discovery in Databases - International Workshops of ECML PKDD

Christen, V., Christen, P., Rahm, E.: Informativeness-based active learning for entity resolution. In: Machine Learning and Knowledge Discovery in Databases - International Workshops of ECML PKDD. Communications in Computer and Information Science, vol. 1168, pp. 125–141. Springer (2019). https://doi.org/10.1007/978-3-030-43887-6\_11

work page doi:10.1007/978-3-030-43887-6 2019

[7] [7]

Doan, A., Konda, P., C., P.S.G., Govind, Y., Paulsen, D., Chandrasekhar, K., Martinkus, P., Christie, M.: Magellan: toward building ecosystems of entity matching solutions. Commun. ACM 63(8), 83–91 (2020). https: //doi.org/10.1145/3405476, https://doi.org/10.1145/3405476

work page doi:10.1145/3405476 2020

[8] [8]

Undergraduate Texts in Mathematics, Springer (2008)

Harris, J.M., Hirst, J.L., Mossinghoff, M.J.: Combinatorics and Graph The- ory, Second Edition. Undergraduate Texts in Mathematics, Springer (2008)

work page 2008

[9] [9]

Herbold, S.: Autorank: A python package for automated ranking of classi- fiers. J. Open Source Softw. 5(48), 2173 (2020). https://doi.org/10.2 1105/JOSS.02173, https://doi.org/10.21105/joss.02173

work page doi:10.21105/joss.02173 2020

[10] [10]

IEEE Trans

Hildebrandt, K., Panse, F., Wilcke, N., Ritter, N.: Large-scale data pollution with apache spark. IEEE Trans. Big Data 6(2), 396–411 (2020). https: //doi.org/10.1109/TBDATA.2016.2637378

work page doi:10.1109/tbdata.2016.2637378 2020

[11] [11]

arXiv preprint (2023)

Hofer, M., Obraczka, D., Saeedi, A., Kopcke, H., Rahm, E.: Construction of knowledge graphs: State and challenges. arXiv preprint (2023). https: //doi.org/https://doi.org/10.48550/arXiv.2302.11509

work page doi:10.48550/arxiv.2302.11509 2023

[12] [12]

In: Datenbanksysteme f¨ ur Business, Tech- nologie und Web (BTW)

Lerm, S., Saeedi, A., Rahm, E.: Extended affinity propagation clustering for multi-source entity resolution. In: Datenbanksysteme f¨ ur Business, Tech- nologie und Web (BTW). pp. 217–236 (2021). https://doi.org/10.184 20/btw2021-11 Graph-based Active Learning for Entity Cluster Repair 17

work page 2021

[13] [13]

In: Thirty-Fifth AAAI Conference on Artificial Intelligence

Li, B., Miao, Y., Wang, Y., Sun, Y., Wang, W.: Improving the efficiency and effectiveness for bert-based entity resolution. In: Thirty-Fifth AAAI Conference on Artificial Intelligence. pp. 13226–13233. AAAI Press (2021). https://doi.org/10.1609/AAAI.V35I15.17562

work page doi:10.1609/aaai.v35i15.17562 2021

[14] [14]

PVLDB Endowment 8(2), 125–136 (Oct 2014)

Mozafari, B., Sarkar, P., Franklin, M., Jordan, M., Madden, S.: Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB Endowment 8(2), 125–136 (Oct 2014)

work page 2014

[15] [15]

In: Das, G., Jermaine, C.M., Bernstein, P.A

Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: A design space exploration. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data. pp. 19–34. ACM (2018). https://doi.org/10.1145/3183713.3196926

work page doi:10.1145/3183713.3196926 2018

[16] [16]

Higman’s Lemma and its Computational Content

Nentwig, M., Groß, A., M¨ oller, M., Rahm, E.: Distributed holistic clustering on linked data. In: On the Move to Meaningful Internet Systems. OTM 2017 Conferences - Confederated International Conferences: CoopIS, C&TC, and ODBASE 2017, Proceedings, Part II. Lecture Notes in Computer Science, vol. 10574, pp. 371–382. Springer (2017). https://doi.org/10.10...

work page doi:10.1007/97 2017

[17] [17]

Semantic Web 8(3), 419–436 (2017)

Nentwig, M., Hartung, M., Ngomo, A.N., Rahm, E.: A survey of current link discovery frameworks. Semantic Web 8(3), 419–436 (2017). https: //doi.org/10.3233/SW-150210, https://doi.org/10.3233/SW-150210

work page doi:10.3233/sw-150210 2017

[18] [18]

Newman, M.E.J.: Networks: An introduction (2010), https://api.semant icscholar.org/CorpusID:60557556

work page 2010

[19] [19]

K¨ unstliche Intell.35(3), 413–423 (2021)

Ngomo, A.N., Sherif, M.A., Georgala, K., Hassan, M.M., Dreßler, K., Lyko, K., Obraczka, D., Soru, T.: LIMES: A framework for link discovery on the semantic web. K¨ unstliche Intell.35(3), 413–423 (2021). https://doi.or g/10.1007/S13218-021-00713-X , https://doi.org/10.1007/s13218-0 21-00713-x

work page doi:10.1007/s13218-021-00713-x 2021

[20] [20]

In: The Semantic Web: Trends and Challenges - 11th International Conference, ESWC 2014, Proceedings

Ngomo, A.N., Sherif, M.A., Lyko, K.: Unsupervised link discovery through knowledge base repair. In: The Semantic Web: Trends and Challenges - 11th International Conference, ESWC 2014, Proceedings. Lecture Notes in Computer Science, vol. 8465, pp. 380–394. Springer (2014). https://doi. org/10.1007/978-3-319-07443-6\_26

work page doi:10.1007/978-3-319-07443-6 2014

[21] [21]

In: The Semantic Web: Research and Applications

Ngonga Ngomo, A.C., Lyko, K.: Eagle: Efficient active learning of link spec- ifications using genetic programming. In: The Semantic Web: Research and Applications. pp. 149–163. Berlin, Heidelberg (2012)

work page 2012

[22] [22]

arXiv preprint (2023)

Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., Wu, X.: Unifying large language models and knowledge graphs: A roadmap. arXiv preprint (2023). https://doi.org/10.48550/ARXIV.2306.08302

work page doi:10.48550/arxiv.2306.08302 2023

[23] [23]

In: Abell´ o, A., Vassiliadis, P., Romero, O., Wrembel, R., Bugiotti, F., Gamper, J., Vargas- Solar, G., Zumpano, E

Peeters, R., Bizer, C.: Using ChatGPT for entity matching. In: Abell´ o, A., Vassiliadis, P., Romero, O., Wrembel, R., Bugiotti, F., Gamper, J., Vargas- Solar, G., Zumpano, E. (eds.) New Trends in Database and Information Systems - ADBIS 2023. Communications in Computer and Information Sci- ence, vol. 1850, pp. 221–230. Springer (2023). https://doi.org/10...

work page 2023

[24] [24]

In: The Semantic Web - ISWC 2021 - 20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24-28, 2021, Proceedings

Primpeli, A., Bizer, C.: Graph-boosted active learning for multi-source en- tity resolution. In: The Semantic Web - ISWC 2021 - 20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24-28, 2021, Proceedings. Lecture Notes in Computer Science, vol. 12922, pp. 182–199. Springer (2021). https://doi.org/10.1007/978-3-030-88361-4\_11

work page doi:10.1007/978-3-030-88361-4 2021

[25] [25]

In: IC3K

Saeedi, A., David, L., Rahm, E.: Matching entities from multiple sources with hierarchical agglomerative clustering. In: IC3K. pp. 40–50. SCITEPRESS (2021). https://doi.org/10.5220/0010649600003064

work page doi:10.5220/0010649600003064 2021

[26] [26]

In: The Semantic Web - 15th International Conference, ESWC 2018, Proceedings

Saeedi, A., Peukert, E., Rahm, E.: Using link features for entity clustering in knowledge graphs. In: The Semantic Web - 15th International Conference, ESWC 2018, Proceedings. Lecture Notes in Computer Science, vol. 10843, pp. 576–592. Springer (2018). https://doi.org/10.1007/978-3-319-934 17-4\_37

work page doi:10.1007/978-3-319-934 2018

[27] [27]

In: ESWC

Saeedi, A., Peukert, E., Rahm, E.: Incremental multi-source entity resolu- tion for knowledge graph completion. In: ESWC. vol. 12123, pp. 393–408. Springer (2020). https://doi.org/10.1007/978-3-030-49461-2_23

work page doi:10.1007/978-3-030-49461-2_23 2020

[28] [28]

In: Chirkova, R., Dogac, A., ¨Ozsu, M.T., Sellis, T.K

Shen, W., DeRose, P., Vu, L.H., Doan, A., Ramakrishnan, R.: Source-aware entity matching: A compositional approach. In: Chirkova, R., Dogac, A., ¨Ozsu, M.T., Sellis, T.K. (eds.) Proceedings of the 23rd International Con- ference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007. pp. 196–205. IEEE Computer Society (2007...

work page doi:10.1109/icde.2007.367865 2007

[29] [29]

arXiv preprint (2023)

Yang, L., Chen, H., Li, Z., Ding, X., Wu, X.: ChatGPT is not enough: En- hancing large language models with knowledge graphs for fact-aware lan- guage modeling. arXiv preprint (2023). https://doi.org/10.48550/ARX IV.2306.11489

work page doi:10.48550/arx 2023