pith. sign in

arxiv: 1411.7191 · v3 · pith:5ORUJSF4new · submitted 2014-11-26 · 💻 cs.DS

Hashing for statistics over k-partitions

classification 💻 cs.DS
keywords hashfunctiontabulationbinsboundsconcentrationflajolethashing
0
0 comments X
read the original abstract

In this paper we analyze a hash function for $k$-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin~[FOCS'83] in order to save a factor $\Omega(k)$ of time per element over $k$ independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used HyperLogLog algorithm of Flajolet et al.~[AOFA'97] and in large-scale machine learning by Li et al.~[NIPS'12] for minwise estimation of set similarity. The main issue of $k$-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of $k$-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. a simple and efficient construction for invertible bloom filters and uniform hashing on a given set.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.