pith. sign in

arxiv: 1809.11084 · v1 · pith:EAES6WT7new · submitted 2018-09-28 · 💻 cs.DB · cs.LG· stat.ML

Reuse and Adaptation for Entity Resolution through Transfer Learning

classification 💻 cs.DB cs.LGstat.ML
keywords datatrainingalgorithmsdatasetdatasetsentityexperimentsfeature
0
0 comments X
read the original abstract

Entity resolution (ER) is one of the fundamental problems in data integration, where machine learning (ML) based classifiers often provide the state-of-the-art results. Considerable human effort goes into feature engineering and training data creation. In this paper, we investigate a new problem: Given a dataset D_T for ER with limited or no training data, is it possible to train a good ML classifier on D_T by reusing and adapting the training data of dataset D_S from same or related domain? Our major contributions include (1) a distributed representation based approach to encode each tuple from diverse datasets into a standard feature space; (2) identification of common scenarios where the reuse of training data can be beneficial; and (3) five algorithms for handling each of the aforementioned scenarios. We have performed comprehensive experiments on 12 datasets from 5 different domains (publications, movies, songs, restaurants, and books). Our experiments show that our algorithms provide significant benefits such as providing superior performance for a fixed training data size.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.