pith. machine review for the scientific record. sign in

arxiv: 1805.12319 · v3 · submitted 2018-05-31 · 💻 cs.DB

Recognition: unknown

Skyblocking for Entity Resolution

Authors on Pith no claims yet
classification 💻 cs.DB
keywords schemeblockingskylinelearningschemesskylinesspaceapproach
0
0 comments X
read the original abstract

In this paper, for the first time, we introduce the concept of skyblocking, which aims to efficiently identify the "most preferred" blocking scheme in terms of a given set of selection criteria for entity resolution blocking. To capture all possible preferred blocking schemes, scheme skyline (i.e. blocking schemes on the skyline) has been studied in a multi-dimensional scheme space with dimensions corresponding to selection criteria for blocking (e.g. PC and PQ). However, applying traditional skyline techniques to learn scheme skylines is a non-trivial task. Due to the unique characteristics of blocking schemes, we face several challenges, such as: how to find a balanced number of match and non-match labels to effectively approximate a block scheme in a scheme space, and how to design efficient skyline algorithms to explore a scheme space for finding scheme skylines. To overcome these challenges, we propose a scheme skyline learning approach, which incorporates skyline techniques into an active learning process of scheme skylines. We have conducted experiments over four real-world datasets. The experimental results show that our approach is able to efficiently identify scheme skylines in a large scheme space only using a limited number of labels. Our approach also outperforms the state-of-the-art approaches for learning blocking schemes in several aspects, including: label efficiency, blocking quality and learning efficiency.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.