pith. sign in

arxiv: 1502.06823 · v1 · pith:VYPTAQ6Jnew · submitted 2015-02-24 · 💻 cs.DB

CrowdGather: Entity Extraction over Structured Domains

classification 💻 cs.DB
keywords extractionentityqueriesdomainsanotherconstructiondatadomain
0
0 comments X p. Extension
pith:VYPTAQ6J Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{VYPTAQ6J}

Prints a linked pith:VYPTAQ6J badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

Crowdsourced entity extraction is often used to acquire data for many applications, including recommendation systems, construction of aggregated listings and directories, and knowledge base construction. Current solutions focus on entity extraction using a single query, e.g., only using "give me another restaurant", when assembling a list of all restaurants. Due to the cost of human labor, solutions that focus on a single query can be highly impractical. In this paper, we leverage the fact that entity extraction often focuses on {\em structured domains}, i.e., domains that are described by a collection of attributes, each potentially exhibiting hierarchical structure. Given such a domain, we enable a richer space of queries, e.g., "give me another Moroccan restaurant in Manhattan that does takeout". Naturally, enabling a richer space of queries comes with a host of issues, especially since many queries return empty answers. We develop new statistical tools that enable us to reason about the gain of issuing {\em additional queries} given little to no information, and show how we can exploit the overlaps across the results of queries for different points of the data domain to obtain accurate estimates of the gain. We cast the problem of {\em budgeted entity extraction} over large domains as an adaptive optimization problem that seeks to maximize the number of extracted entities, while minimizing the overall extraction costs. We evaluate our techniques with experiments on both synthetic and real-world datasets, demonstrating a yield of up to 4X over competing approaches for the same budget.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.