WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction

Bhavana Dalvi; Jamie Callan; William W. Cohen

arxiv: 1307.0261 · v1 · pith:CEVZMGLWnew · submitted 2013-07-01 · 💻 cs.LG · cs.CL· cs.IR

WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction

Bhavana Dalvi , William W. Cohen , Jamie Callan This is my paper

classification 💻 cs.LG cs.CLcs.IR

keywords methodconcept-instancepairsclusterscorpusextractingextractionhearst

0 comments

read the original abstract

We describe a open-domain information extraction method for extracting concept-instance pairs from an HTML corpus. Most earlier approaches to this problem rely on combining clusters of distributionally similar terms and concept-instance pairs obtained with Hearst patterns. In contrast, our method relies on a novel approach for clustering terms found in HTML tables, and then assigning concept names to these clusters using Hearst patterns. The method can be efficiently applied to a large corpus, and experimental results on several datasets show that our method can accurately extract large numbers of concept-instance pairs.

This paper has not been read by Pith yet.

WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction

discussion (0)