A High-Performance Algorithm for Identifying Frequent Items in Data Streams

arxiv: 1705.07001 · v2 · pith:XET3RJIKnew · submitted 2017-05-19 · 💻 cs.DS

A High-Performance Algorithm for Identifying Frequent Items in Data Streams

Daniel Anderson , Pryce Bevan , Kevin Lang , Edo Liberty , Lee Rhodes , Justin Thaler This is my paper

classification 💻 cs.DS

keywords algorithmdatapriorstreamscommondescribegriesimproves

0 comments p. Extension

pith:XET3RJIK Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{XET3RJIK}

Prints a linked pith:XET3RJIK badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

Estimating frequencies of items over data streams is a common building block in streaming data measurement and analysis. Misra and Gries introduced their seminal algorithm for the problem in 1982, and the problem has since been revisited many times due its practicality and applicability. We describe a highly optimized version of Misra and Gries' algorithm that is suitable for deployment in industrial settings. Our code is made public via an open source library called DataSketches that is already used by several companies and production systems. Our algorithm improves on two theoretical and practical aspects of prior work. First, it handles weighted updates in amortized constant time, a common requirement in practice. Second, it uses a simple and fast method for merging summaries that asymptotically improves on prior work even for unweighted streams. We describe experiments confirming that our algorithms are more efficient than prior proposals.

This paper has not been read by Pith yet.

A High-Performance Algorithm for Identifying Frequent Items in Data Streams

discussion (0)