pith. sign in

arxiv: 1706.01172 · v1 · pith:ZBLULL6Snew · submitted 2017-06-05 · 💻 cs.DS

Improved Consistent Weighted Sampling Revisited

classification 💻 cs.DS
keywords icwssetsalgorithmimprovedweightedcomplexityconsistentdata
0
0 comments X
read the original abstract

Min-Hash is a popular technique for efficiently estimating the Jaccard similarity of binary sets. Consistent Weighted Sampling (CWS) generalizes the Min-Hash scheme to sketch weighted sets and has drawn increasing interest from the community. Due to its constant-time complexity independent of the values of the weights, Improved CWS (ICWS) is considered as the state-of-the-art CWS algorithm. In this paper, we revisit ICWS and analyze its underlying mechanism to show that there actually exists dependence between the two components of the hash-code produced by ICWS, which violates the condition of independence. To remedy the problem, we propose an Improved ICWS (I$^2$CWS) algorithm which not only shares the same theoretical computational complexity as ICWS but also abides by the required conditions of the CWS scheme. The experimental results on a number of synthetic data sets and real-world text data sets demonstrate that our I$^2$CWS algorithm can estimate the Jaccard similarity more accurately, and also compete with or outperform the compared methods, including ICWS, in classification and top-$K$ retrieval, after relieving the underlying dependence.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.