CrowdGather: Entity Extraction over Structured Domains

read the original abstract

Crowdsourced entity extraction is often used to acquire data for many applications, including recommendation systems, construction of aggregated listings and directories, and knowledge base construction. Current solutions focus on entity extraction using a single query, e.g., only using "give me another restaurant", when assembling a list of all restaurants. Due to the cost of human labor, solutions that focus on a single query can be highly impractical. In this paper, we leverage the fact that entity extraction often focuses on {\em structured domains}, i.e., domains that are described by a collection of attributes, each potentially exhibiting hierarchical structure. Given such a domain, we enable a richer space of queries, e.g., "give me another Moroccan restaurant in Manhattan that does takeout". Naturally, enabling a richer space of queries comes with a host of issues, especially since many queries return empty answers. We develop new statistical tools that enable us to reason about the gain of issuing {\em additional queries} given little to no information, and show how we can exploit the overlaps across the results of queries for different points of the data domain to obtain accurate estimates of the gain. We cast the problem of {\em budgeted entity extraction} over large domains as an adaptive optimization problem that seeks to maximize the number of extracted entities, while minimizing the overall extraction costs. We evaluate our techniques with experiments on both synthetic and real-world datasets, demonstrating a yield of up to 4X over competing approaches for the same budget.

CrowdGather: Entity Extraction over Structured Domains

discussion (0)