arxiv: 1708.05918 · v3 · pith:LFQ4RG7Cnew · submitted 2017-08-20 · 💻 cs.DB

Adaptive Sampling for Rapidly Matching Histograms

Stephen Macke , Yiming Zhang , Silu Huang , Aditya Parameswaran This is my paper

classification 💻 cs.DB

keywords histogramsdistributionfastmatchsamplinghistsimlargeapproachhistogram

0 comments p. Extension

Add this Pith Number to your LaTeX paper

\usepackage{pith}
\pithnumber{LFQ4RG7C}

Prints a linked pith:LFQ4RG7C badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

In exploratory data analysis, analysts often have a need to identify histograms that possess a specific distribution, among a large class of candidate histograms, e.g., find countries whose income distribution is most similar to that of Greece. This distribution could be a new one that the user is curious about, or a known distribution from an existing histogram visualization. At present, this process of identification is brute-force, requiring the manual generation and evaluation of a large number of histograms. We present FastMatch: an end-to-end approach for interactively retrieving the histogram visualizations most similar to a user-specified target, from a large collection of histograms. The primary technical contribution underlying FastMatch is a probabilistic algorithm, HistSim, a theoretically sound sampling-based approach to identify the top-$k$ closest histograms under $\ell_1$ distance. While HistSim can be used independently, within FastMatch we couple HistSim with a novel system architecture that is aware of practical considerations, employing asynchronous block-based sampling policies, building on lightweight sampling engines developed in recent work. FastMatch obtains near-perfect accuracy with up to $35\times$ speedup over approaches that do not use sampling on several real-world datasets.

This paper has not been read by Pith yet.

Adaptive Sampling for Rapidly Matching Histograms

discussion (0)