Algorithms for Similarity Search and Pseudorandomness

Tobias Christiani

arxiv: 1906.09430 · v1 · pith:ZRCSB2HUnew · submitted 2019-06-22 · 💻 cs.DS

Algorithms for Similarity Search and Pseudorandomness

Tobias Christiani This is my paper

Pith reviewed 2026-05-25 18:11 UTC · model grok-4.3

classification 💻 cs.DS

keywords approximate near neighbor searchlocality-sensitive hashingspace-time tradeoffspseudorandom numberssimilarity joinslower boundshash functions

0 comments

The pith

An improved locality-sensitive hashing framework solves approximate near neighbor search with fewer hash evaluations and lower word-RAM complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an improved framework for the approximate near neighbor problem that uses locality-sensitive hashing. This framework reduces the number of hash function evaluations and the word-RAM complexity relative to the standard framework. It also supplies frameworks for space-time tradeoffs with tight upper and lower bounds under cosine similarity and a novel method for search over sets that comes with a matching lower bound. Separate constructions give deterministic algorithms for high-quality pseudorandom numbers in constant time with near-optimal space and randomized hash families that nearly match cell-probe lower bounds. These advances matter because they make similarity search in large data sets more efficient and strengthen the guarantees available for pseudorandom generation.

Core claim

The author claims that an improved locality-sensitive hashing framework reduces the number of evaluations of locality-sensitive hash functions and the word-RAM complexity for solving the approximate near neighbor problem, provides space-time tradeoffs with tight upper and lower bounds for cosine similarity, offers a novel approach for the problem on sets with a matching lower bound, shows optimality of a Boolean scheme, gives tight lower bounds for asymmetric hashing, and supplies deterministic and randomized methods for high-quality pseudorandom number generation with near-optimal space and time.

What carries the argument

The improved framework for the approximate near neighbor problem that uses locality-sensitive hashing to reduce hash evaluations and word-RAM complexity.

If this is right

The number of locality-sensitive hash function evaluations drops compared to the standard framework.
Word-RAM complexity improves for approximate near neighbor solutions.
Tight upper and lower bounds characterize space-time tradeoffs under cosine similarity.
A novel approach improves performance for search on sets while matching a lower bound.
Deterministic algorithms generate high-quality pseudorandom numbers in constant time with near-optimal space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practical retrieval systems could adopt the self-tuning variant to improve performance without manual tuning of parameters.
The matching lower bounds suggest that further asymptotic gains for set-based search are unlikely under the stated model.
The pseudorandom constructions could strengthen randomized algorithms that need high-quality bits while staying within tight space limits.

Load-bearing premise

Locality-sensitive hash families exist that preserve the required similarities with the probabilities stated in the standard definitions.

What would settle it

A data distribution or similarity measure where the new framework requires the same number of hash evaluations as the standard framework would show that the claimed reduction does not hold.

Figures

Figures reproduced from arXiv: 1906.09430 by Tobias Christiani.

**Figure 1.1.** Figure 1.1: The gap in the ρ-value between the best known upper and lower bounds for families of ((1 − α)d/2,(1 − β)d/2, p1, p2)-sensitive hash functions in d-dimensional Hamming space. 1.1.4 Beyond locality-sensitive hashing A common theme among recent advances in the area of theoretical approximate similarity search has been to move beyond standard localitysensitive hashing [14, 100, 149, 24, 16, 54, 56, 22]. The… view at source ↗

**Figure 2.1.** Figure 2.1: Upper bounds on the number of locality-sensitive hash functions from a [PITH_FULL_IMAGE:figures/full_fig_p036_2_1.png] view at source ↗

**Figure 2.** Figure 2: reveals that the number of hash functions used by the Indyk [PITH_FULL_IMAGE:figures/full_fig_p036_2.png] view at source ↗

**Figure 2.2.** Figure 2.2: The number of locality-sensitive hash functions from a [PITH_FULL_IMAGE:figures/full_fig_p047_2_2.png] view at source ↗

**Figure 2.** Figure 2: where we have set [PITH_FULL_IMAGE:figures/full_fig_p047_2.png] view at source ↗

**Figure 2.3.** Figure 2.3: The number of locality-sensitive hash functions from a [PITH_FULL_IMAGE:figures/full_fig_p048_2_3.png] view at source ↗

**Figure 4.1.** Figure 4.1: Chosen Path uses a branching process to associate each vector [PITH_FULL_IMAGE:figures/full_fig_p093_4_1.png] view at source ↗

**Figure 4.2.** Figure 4.2: Exponent when searching for a vector with Jaccard similarity [PITH_FULL_IMAGE:figures/full_fig_p098_4_2.png] view at source ↗

**Figure 4.** Figure 4: ). The details of the comparisons are given in Appendix 4.8. [PITH_FULL_IMAGE:figures/full_fig_p107_4.png] view at source ↗

**Figure 4.3.** Figure 4.3: The difference ρ − ρminhash comparing Chosen Path and MinHash in terms of Braun-Blanquet similarities 0 < b2 < b1 < 1. of Theorem 4.1. In Appendix 4.8 we show that ρ < ρdatadep whenever b2 ≤ 1/5, or equivalently, whenever j2 ≤ 1/9. Revisiting the above example, for j1 = 0.2 and j2 = 0.1 we have ρdatadep = 0.6875 which is about 0.043 more than Chosen Path [PITH_FULL_IMAGE:figures/full_fig_p108_4_3.png] view at source ↗

**Figure 4.4.** Figure 4.4: The difference ρ − ρangular comparing Chosen Path and angular LSH in terms of BraunBlanquet similarities 0 < b2 < b1 < 1. space by O’Donnell et al. [121] which we use, through a reduction, to show LSH lower bounds under Braun-Blanquet similarity. Lower bounds for locality-sensitive maps. Because our upper bound is based on a locality-sensitive map MB and not LSH-based we first show that LSH lower bounds… view at source ↗

**Figure 4.5.** Figure 4.5: The difference ρ − ρdatadep comparing Chosen Path and data-dependent LSH in terms of Braun-Blanquet similarities 0 < b2 < b1 < 1. In the area of the parameter space that is colored blue we have that ρ ≤ ρdatadep while for the red area it holds that ρ > ρdatadep. Lemma 4.4. Suppose we have a (s1,s2, m1, m2)-sensitive family of maps M. Then we can construct a (s1,s2, p1, p2)-sensitive family of hash functi… view at source ↗

**Figure 4.6.** Figure 4.6: Solution with lowest ρ-value for the (j1, j2)-approximate Jaccard similarity search problem for different values of β. Blue is angular LSH. Green is MinHash. Red is Chosen Path. Note the difference in the axes for different values of β as it must hold that 0 ≤ j2 ≤ j1 ≤ β [PITH_FULL_IMAGE:figures/full_fig_p121_4_6.png] view at source ↗

**Figure 5.1.** Figure 5.1: Path termination depth in the Chosen Path Tree [PITH_FULL_IMAGE:figures/full_fig_p142_5_1.png] view at source ↗

**Figure 5.** Figure 5: shows the join time speedup that [PITH_FULL_IMAGE:figures/full_fig_p150_5.png] view at source ↗

**Figure 5.2.** Figure 5.2: Join time of CPSJoin with at least 90% recall relative to AllPairs. on the presence of rare tokens in the data to perform well. This difference is showcased with the synthetic TOKEN data sets. BayesLSH. The poor performance of BayesLSH compared to the other algorithms (BayesLSH was always slower) can most likely be tracked down to differences in the implementation of the candidate generation methods of B… view at source ↗

**Figure 5.** Figure 5: (a) shows the effect of the brute force limit on the join time. [PITH_FULL_IMAGE:figures/full_fig_p153_5.png] view at source ↗

**Figure 5.3.** Figure 5.3: Relative join time for CPSJoin with at least [PITH_FULL_IMAGE:figures/full_fig_p154_5_3.png] view at source ↗

**Figure 5.** Figure 5: (c) shows the effect of sketch length on the join time. There [PITH_FULL_IMAGE:figures/full_fig_p155_5.png] view at source ↗

read the original abstract

We study the problem of approximate near neighbor (ANN) search and show the following results: - An improved framework for solving the ANN problem using locality-sensitive hashing, reducing the number of evaluations of locality-sensitive hash functions and the word-RAM complexity compared to the standard framework. - A framework for solving the ANN problem with space-time tradeoffs as well as tight upper and lower bounds for the space-time tradeoff of framework solutions to the ANN problem under cosine similarity. - A novel approach to solving the ANN problem on sets along with a matching lower bound, improving the state of the art. - A self-tuning version of the algorithm is shown through experiments to outperform existing similarity join algorithms. - Tight lower bounds for asymmetric locality-sensitive hashing which has applications to the approximate furthest neighbor problem, orthogonal vector search, and annulus queries. - A proof of the optimality of a well-known Boolean locality-sensitive hashing scheme. We study the problem of efficient algorithms for producing high-quality pseudorandom numbers and obtain the following results: - A deterministic algorithm for generating pseudorandom numbers of arbitrarily high quality in constant time using near-optimal space. - A randomized construction of a family of hash functions that outputs pseudorandom numbers of arbitrarily high quality with space usage and running time nearly matching known cell-probe lower bounds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Collects several LSH improvements with matching bounds plus separate PRG results; experiments are the least developed part.

read the letter

The main new pieces are an improved LSH framework that cuts hash evaluations and word-RAM cost, space-time tradeoffs with tight upper and lower bounds under cosine similarity, and a set-based ANN method that comes with a matching lower bound. It also gives tight lower bounds for asymmetric LSH and proves that a standard Boolean LSH scheme is optimal. The PRG side adds a deterministic constant-time generator with near-optimal space and a randomized hash family that nearly matches cell-probe bounds. Experiments claim a self-tuning version beats existing similarity-join algorithms. The matching bounds and the optimality proof are the cleanest parts; they show the work is not just extending prior analyses but closing some gaps. The frameworks look like genuine reductions rather than redefinitions. The soft spots are the experiments, which are only summarized, and the fact that the PRG results sit apart from the ANN material. Without the proofs it is impossible to check the exact reductions or the model assumptions on the hash families, but the abstract shows no internal mismatch between upper and lower bounds. This is for researchers who care about concrete LSH tradeoffs and derandomization. A reader working on ANN data structures would get usable frameworks and bounds from it. The paper deserves a serious referee because the claims are specific and several of them come with matching lower bounds.

Referee Report

0 major / 2 minor

Summary. The paper studies approximate near neighbor (ANN) search and pseudorandomness. For ANN it claims an improved LSH framework reducing hash evaluations and word-RAM complexity versus the standard framework; a space-time tradeoff framework with tight upper/lower bounds under cosine similarity; a novel set-similarity approach with matching lower bound improving the state of the art; experimental evidence that a self-tuning variant outperforms existing similarity-join algorithms; tight lower bounds for asymmetric LSH (with applications to furthest-neighbor, orthogonal-vector, and annulus queries); and a proof that a well-known Boolean LSH scheme is optimal. For pseudorandomness it claims a deterministic constant-time generator of arbitrarily high quality using near-optimal space and a randomized hash-family construction whose space and time nearly match known cell-probe lower bounds.

Significance. If the stated reductions, tradeoffs, and matching bounds hold, the work would strengthen the theoretical foundations of LSH-based ANN and provide near-optimal pseudorandom constructions. The explicit matching of upper and lower bounds for the cosine-similarity space-time tradeoff and the set-similarity result are particularly valuable; the experimental self-tuning claim is independent of the theoretical contributions.

minor comments (2)

The abstract states that the self-tuning algorithm 'outperforms existing similarity join algorithms' via experiments, but the manuscript should clarify the precise baselines, data sets, and metrics used so that the experimental claim can be evaluated independently of the theoretical sections.
Notation for the improved LSH framework (number of hash evaluations, word-RAM cost) is introduced in the abstract without cross-reference to the section that defines the standard framework; adding an explicit pointer would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, detailed summary of our contributions, and recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; theoretical extensions are self-contained

full rationale

The paper presents algorithmic improvements to LSH-based ANN search, space-time tradeoffs with matching upper/lower bounds, a novel set-based approach, and pseudorandomness constructions. These are framed as new analyses, frameworks, and proofs extending standard LSH assumptions rather than redefining quantities in terms of themselves or calling fitted parameters predictions. No load-bearing self-citations or ansatzes are indicated in the abstract or claimed results; the work is self-contained against external benchmarks like prior LSH families and cell-probe lower bounds.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all such details reside in the full paper.

pith-pipeline@v0.9.0 · 5757 in / 1095 out tokens · 31053 ms · 2026-05-25T18:11:27.800787+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

176 extracted references · 176 canonical work pages · 7 internal anchors

[1]

ACM paris kanellakis theory and practice award

ACM. ACM paris kanellakis theory and practice award. https: //awards.acm.org/award_winners/charikar_0308379,

work page
[2]

[Online; accessed 26-April-2018]

work page 2018
[3]

Aggarwal, D

C. Aggarwal, D. A. Keim, and A. Hinneburg. On the surprising behaviour of distance metrics in high dimensional space. In Proc. ICDT ’01, pages 420–434, 2001

work page 2001
[4]

A. Agresti. Bounds on the extinction time distribution of a branch- ing process. Advances in Applied Probability, 6(2):322–335, 1974

work page 1974
[5]

T. D. Ahle, M. Aumüller, and R. Pagh. Parameter-free locality sensitive hashing for spherical range reporting. In Proc. SODA ’17, pages 239–256, 2017

work page 2017
[6]

T. D. Ahle, R. Pagh, I. P . Razenshteyn, and F. Silvestri. On the complexity of inner product similarity join. In Proc. PODS’16 , pages 151–164, 2016

work page 2016
[7]

Alman and R

J. Alman and R. Williams. Probabilistic polynomials and hamming nearest neighbors. In Proc. FOCS ’15, pages 136–150, 2015

work page 2015
[8]

Alon and N

N. Alon and N. Asaf. k-wise independent random graphs. In Proc. FOCS ’08, pages 813–822, 2008

work page 2008
[9]

N. Alon, O. Goldreich, J. Håstad, and R. Peralta. Simple construc- tions of almost k-wise independent random variables. Random Structures & Algorithms, 3(3):289–304, 1992

work page 1992
[10]

A. Andoni. Nearest neighbor search: the old, the new, and the impossible. PhD thesis, MIT, 2009. 220 Bibliography

work page 2009
[11]

Andoni and P

A. Andoni and P . Indyk. Efﬁcient algorithms for substring near neighbor problem. In Proc. SODA ’06, pages 1203–1212, 2006

work page 2006
[12]

Andoni and P

A. Andoni and P . Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proc. FOCS ’06, pages 459–468, 2006

work page 2006
[13]

Andoni and P

A. Andoni and P . Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008

work page 2008
[14]

Andoni, P

A. Andoni, P . Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal lsh for angular distance. In Proc. NIPS ’15, pages 1225–1233, 2015

work page 2015
[15]

Andoni, P

A. Andoni, P . Indyk, H. L. Nguyen, and I. P . Razenshteyn. Beyond locality-sensitive hashing. In Proc. SODA ’14 , pages 1018–1028, 2014

work page 2014
[16]

Andoni, P

A. Andoni, P . Indyk, and M. Patrascu. On the optimality of the dimensionality reduction method. In Proc. FOCS ’06, pages 449–458, 2006

work page 2006
[17]

Andoni, T

A. Andoni, T. Laarhoven, I. P . Razenshteyn, and E. Waingarten. Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proc. SODA ’17, pages 47–66, 2017

work page 2017
[18]

Andoni and I

A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. STOC ’15, pages 793–801, 2015

work page 2015
[19]

Andoni, I

A. Andoni, I. P . Razenshteyn, and N. S. Nosatzki. LSH forest: Practical algorithms made theoretical. In Proc. SODA ’17, pages 67–78, 2017

work page 2017
[20]

Andoni and I

A. Andoni and I. Razensteyn. Tight lower bounds for data- dependent locality-sensitive hashing. In Proc. SoCG ’16 , pages 9:1–9:11, 2016

work page 2016
[21]

Arasu, V

A. Arasu, V . Ganti, and R. Kaushik. Efﬁcient exact set-similarity joins. In Proceedings of the 32nd international conference on Very large data bases, pages 918–929. VLDB Endowment, 2006. Bibliography 221

work page 2006
[22]

Augsten and M

N. Augsten and M. H. Böhlen. Similarity joins in relational database systems. Synthesis Lectures on Data Management, 5(5):1–124, 2013

work page 2013
[23]

Aumüller, T

M. Aumüller, T. Christiani, R. Pagh, and F. Silvestri. Distance- sensitive hashing. pages 89–104, 2018

work page 2018
[24]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. 16th World Wide Web Conference (WWW) , pages 131–140, 2007

work page 2007
[25]

Becker, L

A. Becker, L. Ducas, N. Gama, and T. Laarhoven. New directions in nearest neighbor searching with applications to lattice sieving. In Proc. SODA ’16, pages 10–24, 2016

work page 2016
[26]

Benyamini and J

Y. Benyamini and J. Lindenstrauss. Geometric nonlinear functional analysis, volume 48. American Mathematical Soc., Providence, Rhode Island, 1998

work page 1998
[27]

Bhattacharya, R

M. Bhattacharya, R. Creutzburg, and J. Astola. Some historical notes on the number theoretic transform. In Proc. 2004 Int. TICS Workshop on Spectral Methods and Multirate Signal Processing , 2004

work page 2004
[28]

Bostan and É

A. Bostan and É. Schost. Polynomial evaluation and interpolation on special sets of points. J. Complexity, 21(4):420–446, 2005

work page 2005
[29]

Braun-Blanquet

J. Braun-Blanquet. Plant sociology. The study of plant communities . McGraw-Hill, 1932

work page 1932
[30]

Braverman

M. Braverman. Polylogarithmic independence fools AC0 circuits. J. ACM, 57(5):28:1–28:10, 2010

work page 2010
[31]

A. Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings , pages 21–29. IEEE, 1997

work page 1997
[32]

A. Z. Broder. Identifying and ﬁltering near-duplicate documents. In Proc. Symp. on Combinatorial Pattern Matching (CPM) , pages 1–10. Springer, 2000

work page 2000
[33]

A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In Proc. STOC ’98, pages 327–336, 1998. 222 Bibliography

work page 1998
[34]

A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min- wise independent permutations. J. Comput. Syst. Sci., 60(3):630–659, 2000

work page 2000
[35]

A. Z. Broder, A. M. Frieze, and E. Upfal. Static and dynamic path selection on expander graphs: a random walk approach. Random Structures & Algorithms, 14(1):87–109, 1999

work page 1999
[36]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syn- tactic clustering of the web. Computer Networks, 29(8-13):1157–1166, 1997

work page 1997
[37]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syn- tactic clustering of the web. Computer Networks and ISDN Systems, 29(8):1157–1166, 1997

work page 1997
[38]

Burhman, P

H. Burhman, P . B. Miltersen, J. Radhakrishnan, and S. Venkatesh. Are bitvectors optimal? SIAM J. Comput., 31(6):1723–1744, 2002

work page 2002
[39]

M. Capalbo. Explicit bounded-degree unique-neighbor concentra- tors. Combinatorica, 25(4):379–391, 2005

work page 2005
[40]

Capalbo, O

M. Capalbo, O. Reingold, S. Vadhan, and A. Wigderson. Random- ness conductors and constant-degree lossless expanders. In Proc. STOC ’02, pages 659–668, 2002

work page 2002
[41]

J. L. Carter and M. N. Wegman. Universal classes of hash functions. In Proc. STOC ’77, pages 106–112, 1977

work page 1977
[42]

J. L. Carter and M. N. Wegman. Universal classes of hash functions. J. Comput. Syst. Sci., 18(2):143–154, 1979

work page 1979
[43]

J. L. Carter and M. N. Wegman. New hash functions and their use in authentication and set equality. J. Comput. System Sci., 22(3):265– 279, 1981

work page 1981
[44]

Chakrabarti, V

A. Chakrabarti, V . Satuluri, A. Srivathsan, and S. Parthasarathy. Locality sensitive hashing for similarity search and estima- tion. https://sites.google.com/site/lshallpairs/,

work page
[45]

Bibliography 223

[Accessed 15-May-2017]. Bibliography 223

work page 2017
[46]

Chakrabarti, V

A. Chakrabarti, V . Satuluri, A. Srivathsan, and S. Parthasarathy. A bayesian perspective on locality sensitive hashing with extensions for kernel methods. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(2):19, 2015

work page 2015
[47]

Chakraborty, E

D. Chakraborty, E. Goldenberg, and M. Kouck `y. Streaming al- gorithms for embedding and computing edit distance in the low distance regime. In Proc. STOC ’16, pages 712–725, 2016

work page 2016
[48]

J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for sim- ulating stable random variables. Jour. Am. Stat. Assoc., 71(354):340– 344, 1976

work page 1976
[49]

Charikar

M. Charikar. Similarity estimation techniques from rounding algo- rithms. In Proc. STOC ’02, pages 380–388, 2002

work page 2002
[50]

Chaudhuri, V

S. Chaudhuri, V . Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proc. 22nd Conference on Data Engineering (ICDE), page 5, 2006

work page 2006
[51]

Chierichetti and R

F. Chierichetti and R. Kumar. Lsh-preserving functions and their applications. J. ACM, 62(5):33, 2015

work page 2015
[52]

Chierichetti, R

F. Chierichetti, R. Kumar, A. Panconesi, and E. Terolli. The distor- tion of locality sensitive hashing. In Proc. ITCS ’17, pages 54:1–54:18, 2017

work page 2017
[53]

S. Choi, S. Cha, and C. C. Tappert. A survey of binary similarity and distance measures. J. Syst. Cybern. Informatics, 8(1):43–48, 2010

work page 2010
[54]

B. Chor, O. Goldreich, J. Hastad, J. Freidmann, S. Rudich, and R. Smolensky. The bit extration problem or t-resilient functions. In Proc. FOCS ’85, pages 396–407, 1985

work page 1985
[55]

Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search

T. Christiani. Fast locality-sensitive hashing for approximate near neighbor search. CoRR, abs/1708.07586, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[56]

Christiani

T. Christiani. A framework for similarity search with space-time tradeoffs using locality-sensitive ﬁltering. In Proc. SODA ’17, pages 31–46, 2017

work page 2017
[57]

Christiani and R

T. Christiani and R. Pagh. Generating k-independent variables in constant time. In Proc. FOCS ’14, pages 196–205, 2014. 224 Bibliography

work page 2014
[58]

Christiani and R

T. Christiani and R. Pagh. Set similarity search beyond minhash. In Proc. STOC ’17, pages 1094–1107, 2017

work page 2017
[59]

Scalable and robust set similarity join

T. Christiani, R. Pagh, and J. Sivertsen. Scalable and robust set similarity join. CoRR, abs/1707.06814, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[60]

Christiani, R

T. Christiani, R. Pagh, and M. Thorup. From independence to expansion and back again. In Proc. STOC ’15, pages 813–820, 2015

work page 2015
[61]

Chung, M

K. Chung, M. Mitzenmacher, and S. P . Vadhan. Why simple hash functions work: Exploiting the entropy in a data stream. Theory of Computing, 9:897–945, 2013

work page 2013
[62]

E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comp. Syst. Sci., 55(3):441–453, 1997

work page 1997
[63]

Cohen, M

E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P . Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations with- out support pruning. IEEE Transactions on Knowledge and Data Engineering, 13(1):64–78, 2001

work page 2001
[64]

Cohen and H

E. Cohen and H. Kaplan. Leveraging discarded samples for tighter estimation of multiple-set aggregates. ACM SIGMETRICS Perfor- mance Evaluation Review, 37(1):251–262, 2009

work page 2009
[65]

Cohen and D

J. Cohen and D. Kane. Bounds on the independence required for cuckoo hashing. http://web.mit.edu/jscohen/www/ cuckoo.pdf, 2009. [Online; accessed 29-April-2018]

work page 2009
[66]

Dahlgaard, M

S. Dahlgaard, M. B. T. Knudsen, and M. Thorup. Fast similarity sketching. In Proc. FOCS ’17, pages 663–671, 2017

work page 2017
[67]

Dasgupta, R

A. Dasgupta, R. Kumar, and T. Sarlós. Fast locality-sensitive hash- ing. In Proc. SIGKDD ’11, pages 1073–1081, 2011

work page 2011
[68]

Datar, N

M. Datar, N. Immorlica, P . Indyk, and V . S. Mirrokni. Locality- sensitive hashing scheme based on p-stable distributions. In Proc. SOCG ’04, pages 253–262, 2004

work page 2004
[69]

On the Complexity of Closest Pair via Polar-Pair of Point-Sets

R. David, Karthik C. S., and B. Laekhanukit. The curse of medium dimension for geometric problems in almost every norm. CoRR, abs/1608.03245, 2016. Bibliography 225

work page internal anchor Pith review Pith/arXiv arXiv 2016
[70]

de Berg, M

M. de Berg, M. van Kreveld, M. Overmars, and O. C. Schwarzkopf. Computational geometry. Springer, Berlin, third edition, 2008

work page 2008
[71]

Dietzfelbinger

M. Dietzfelbinger. On randomness in hash functions (invited talk). In Proc. STACS ’12, pages 25–28, 2012

work page 2012
[72]

Dietzfelbinger and F

M. Dietzfelbinger and F. Meyer auf der Heide. A new universal class of hash functions and dynamic hashing in real time. In Proc. ICALP ’90, pages 6–19, 1990

work page 1990
[73]

Dietzfelbinger and M

M. Dietzfelbinger and M. Rink. Applications of a splitting trick. In Proc. ICALP ’09, pages 354–365, 2009

work page 2009
[74]

Dietzfelbinger and P

M. Dietzfelbinger and P . Woelfel. Almost random graphs with simple hash functions. In Proc. STOC ’03, pages 629–638, 2003

work page 2003
[75]

M. Dubiner. Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem. IEEE Trans. Information Theory, 56(8):4166–4179, 2010

work page 2010
[76]

Duhamel and M

P . Duhamel and M. Vetterli. Fast Fourier transforms: a tutorial review and state of the art. Signal Processing, 19(4):259–299, 1990

work page 1990
[77]

Eshghi and S

K. Eshghi and S. Rajaram. Locality sensitive hash functions based on concomitant rank order statistics. In Proc. KDD ’08 , pages 221–229, 2008

work page 2008
[78]

M. L. Fredman, J. Komlós, and E. Szemerédi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538–544, 1984

work page 1984
[79]

Gao and T

S. Gao and T. Mateer. Additive fast Fourier transforms over ﬁnite ﬁelds. IEEE Trans. Inf. Theory, 56(12):6265–6272, 2010

work page 2010
[80]

Gionis, P

A. Gionis, P . Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. VLDB ’99, pages 518–529, 1999

work page 1999

Showing first 80 references.

[1] [1]

ACM paris kanellakis theory and practice award

ACM. ACM paris kanellakis theory and practice award. https: //awards.acm.org/award_winners/charikar_0308379,

work page

[2] [2]

[Online; accessed 26-April-2018]

work page 2018

[3] [3]

Aggarwal, D

C. Aggarwal, D. A. Keim, and A. Hinneburg. On the surprising behaviour of distance metrics in high dimensional space. In Proc. ICDT ’01, pages 420–434, 2001

work page 2001

[4] [4]

A. Agresti. Bounds on the extinction time distribution of a branch- ing process. Advances in Applied Probability, 6(2):322–335, 1974

work page 1974

[5] [5]

T. D. Ahle, M. Aumüller, and R. Pagh. Parameter-free locality sensitive hashing for spherical range reporting. In Proc. SODA ’17, pages 239–256, 2017

work page 2017

[6] [6]

T. D. Ahle, R. Pagh, I. P . Razenshteyn, and F. Silvestri. On the complexity of inner product similarity join. In Proc. PODS’16 , pages 151–164, 2016

work page 2016

[7] [7]

Alman and R

J. Alman and R. Williams. Probabilistic polynomials and hamming nearest neighbors. In Proc. FOCS ’15, pages 136–150, 2015

work page 2015

[8] [8]

Alon and N

N. Alon and N. Asaf. k-wise independent random graphs. In Proc. FOCS ’08, pages 813–822, 2008

work page 2008

[9] [9]

N. Alon, O. Goldreich, J. Håstad, and R. Peralta. Simple construc- tions of almost k-wise independent random variables. Random Structures & Algorithms, 3(3):289–304, 1992

work page 1992

[10] [10]

A. Andoni. Nearest neighbor search: the old, the new, and the impossible. PhD thesis, MIT, 2009. 220 Bibliography

work page 2009

[11] [11]

Andoni and P

A. Andoni and P . Indyk. Efﬁcient algorithms for substring near neighbor problem. In Proc. SODA ’06, pages 1203–1212, 2006

work page 2006

[12] [12]

Andoni and P

A. Andoni and P . Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proc. FOCS ’06, pages 459–468, 2006

work page 2006

[13] [13]

Andoni and P

A. Andoni and P . Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008

work page 2008

[14] [14]

Andoni, P

A. Andoni, P . Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal lsh for angular distance. In Proc. NIPS ’15, pages 1225–1233, 2015

work page 2015

[15] [15]

Andoni, P

A. Andoni, P . Indyk, H. L. Nguyen, and I. P . Razenshteyn. Beyond locality-sensitive hashing. In Proc. SODA ’14 , pages 1018–1028, 2014

work page 2014

[16] [16]

Andoni, P

A. Andoni, P . Indyk, and M. Patrascu. On the optimality of the dimensionality reduction method. In Proc. FOCS ’06, pages 449–458, 2006

work page 2006

[17] [17]

Andoni, T

A. Andoni, T. Laarhoven, I. P . Razenshteyn, and E. Waingarten. Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proc. SODA ’17, pages 47–66, 2017

work page 2017

[18] [18]

Andoni and I

A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. STOC ’15, pages 793–801, 2015

work page 2015

[19] [19]

Andoni, I

A. Andoni, I. P . Razenshteyn, and N. S. Nosatzki. LSH forest: Practical algorithms made theoretical. In Proc. SODA ’17, pages 67–78, 2017

work page 2017

[20] [20]

Andoni and I

A. Andoni and I. Razensteyn. Tight lower bounds for data- dependent locality-sensitive hashing. In Proc. SoCG ’16 , pages 9:1–9:11, 2016

work page 2016

[21] [21]

Arasu, V

A. Arasu, V . Ganti, and R. Kaushik. Efﬁcient exact set-similarity joins. In Proceedings of the 32nd international conference on Very large data bases, pages 918–929. VLDB Endowment, 2006. Bibliography 221

work page 2006

[22] [22]

Augsten and M

N. Augsten and M. H. Böhlen. Similarity joins in relational database systems. Synthesis Lectures on Data Management, 5(5):1–124, 2013

work page 2013

[23] [23]

Aumüller, T

M. Aumüller, T. Christiani, R. Pagh, and F. Silvestri. Distance- sensitive hashing. pages 89–104, 2018

work page 2018

[24] [24]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. 16th World Wide Web Conference (WWW) , pages 131–140, 2007

work page 2007

[25] [25]

Becker, L

A. Becker, L. Ducas, N. Gama, and T. Laarhoven. New directions in nearest neighbor searching with applications to lattice sieving. In Proc. SODA ’16, pages 10–24, 2016

work page 2016

[26] [26]

Benyamini and J

Y. Benyamini and J. Lindenstrauss. Geometric nonlinear functional analysis, volume 48. American Mathematical Soc., Providence, Rhode Island, 1998

work page 1998

[27] [27]

Bhattacharya, R

M. Bhattacharya, R. Creutzburg, and J. Astola. Some historical notes on the number theoretic transform. In Proc. 2004 Int. TICS Workshop on Spectral Methods and Multirate Signal Processing , 2004

work page 2004

[28] [28]

Bostan and É

A. Bostan and É. Schost. Polynomial evaluation and interpolation on special sets of points. J. Complexity, 21(4):420–446, 2005

work page 2005

[29] [29]

Braun-Blanquet

J. Braun-Blanquet. Plant sociology. The study of plant communities . McGraw-Hill, 1932

work page 1932

[30] [30]

Braverman

M. Braverman. Polylogarithmic independence fools AC0 circuits. J. ACM, 57(5):28:1–28:10, 2010

work page 2010

[31] [31]

A. Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings , pages 21–29. IEEE, 1997

work page 1997

[32] [32]

A. Z. Broder. Identifying and ﬁltering near-duplicate documents. In Proc. Symp. on Combinatorial Pattern Matching (CPM) , pages 1–10. Springer, 2000

work page 2000

[33] [33]

A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In Proc. STOC ’98, pages 327–336, 1998. 222 Bibliography

work page 1998

[34] [34]

A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min- wise independent permutations. J. Comput. Syst. Sci., 60(3):630–659, 2000

work page 2000

[35] [35]

A. Z. Broder, A. M. Frieze, and E. Upfal. Static and dynamic path selection on expander graphs: a random walk approach. Random Structures & Algorithms, 14(1):87–109, 1999

work page 1999

[36] [36]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syn- tactic clustering of the web. Computer Networks, 29(8-13):1157–1166, 1997

work page 1997

[37] [37]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syn- tactic clustering of the web. Computer Networks and ISDN Systems, 29(8):1157–1166, 1997

work page 1997

[38] [38]

Burhman, P

H. Burhman, P . B. Miltersen, J. Radhakrishnan, and S. Venkatesh. Are bitvectors optimal? SIAM J. Comput., 31(6):1723–1744, 2002

work page 2002

[39] [39]

M. Capalbo. Explicit bounded-degree unique-neighbor concentra- tors. Combinatorica, 25(4):379–391, 2005

work page 2005

[40] [40]

Capalbo, O

M. Capalbo, O. Reingold, S. Vadhan, and A. Wigderson. Random- ness conductors and constant-degree lossless expanders. In Proc. STOC ’02, pages 659–668, 2002

work page 2002

[41] [41]

J. L. Carter and M. N. Wegman. Universal classes of hash functions. In Proc. STOC ’77, pages 106–112, 1977

work page 1977

[42] [42]

J. L. Carter and M. N. Wegman. Universal classes of hash functions. J. Comput. Syst. Sci., 18(2):143–154, 1979

work page 1979

[43] [43]

J. L. Carter and M. N. Wegman. New hash functions and their use in authentication and set equality. J. Comput. System Sci., 22(3):265– 279, 1981

work page 1981

[44] [44]

Chakrabarti, V

A. Chakrabarti, V . Satuluri, A. Srivathsan, and S. Parthasarathy. Locality sensitive hashing for similarity search and estima- tion. https://sites.google.com/site/lshallpairs/,

work page

[45] [45]

Bibliography 223

[Accessed 15-May-2017]. Bibliography 223

work page 2017

[46] [46]

Chakrabarti, V

A. Chakrabarti, V . Satuluri, A. Srivathsan, and S. Parthasarathy. A bayesian perspective on locality sensitive hashing with extensions for kernel methods. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(2):19, 2015

work page 2015

[47] [47]

Chakraborty, E

D. Chakraborty, E. Goldenberg, and M. Kouck `y. Streaming al- gorithms for embedding and computing edit distance in the low distance regime. In Proc. STOC ’16, pages 712–725, 2016

work page 2016

[48] [48]

J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for sim- ulating stable random variables. Jour. Am. Stat. Assoc., 71(354):340– 344, 1976

work page 1976

[49] [49]

Charikar

M. Charikar. Similarity estimation techniques from rounding algo- rithms. In Proc. STOC ’02, pages 380–388, 2002

work page 2002

[50] [50]

Chaudhuri, V

S. Chaudhuri, V . Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proc. 22nd Conference on Data Engineering (ICDE), page 5, 2006

work page 2006

[51] [51]

Chierichetti and R

F. Chierichetti and R. Kumar. Lsh-preserving functions and their applications. J. ACM, 62(5):33, 2015

work page 2015

[52] [52]

Chierichetti, R

F. Chierichetti, R. Kumar, A. Panconesi, and E. Terolli. The distor- tion of locality sensitive hashing. In Proc. ITCS ’17, pages 54:1–54:18, 2017

work page 2017

[53] [53]

S. Choi, S. Cha, and C. C. Tappert. A survey of binary similarity and distance measures. J. Syst. Cybern. Informatics, 8(1):43–48, 2010

work page 2010

[54] [54]

B. Chor, O. Goldreich, J. Hastad, J. Freidmann, S. Rudich, and R. Smolensky. The bit extration problem or t-resilient functions. In Proc. FOCS ’85, pages 396–407, 1985

work page 1985

[55] [55]

Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search

T. Christiani. Fast locality-sensitive hashing for approximate near neighbor search. CoRR, abs/1708.07586, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[56] [56]

Christiani

T. Christiani. A framework for similarity search with space-time tradeoffs using locality-sensitive ﬁltering. In Proc. SODA ’17, pages 31–46, 2017

work page 2017

[57] [57]

Christiani and R

T. Christiani and R. Pagh. Generating k-independent variables in constant time. In Proc. FOCS ’14, pages 196–205, 2014. 224 Bibliography

work page 2014

[58] [58]

Christiani and R

T. Christiani and R. Pagh. Set similarity search beyond minhash. In Proc. STOC ’17, pages 1094–1107, 2017

work page 2017

[59] [59]

Scalable and robust set similarity join

T. Christiani, R. Pagh, and J. Sivertsen. Scalable and robust set similarity join. CoRR, abs/1707.06814, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[60] [60]

Christiani, R

T. Christiani, R. Pagh, and M. Thorup. From independence to expansion and back again. In Proc. STOC ’15, pages 813–820, 2015

work page 2015

[61] [61]

Chung, M

K. Chung, M. Mitzenmacher, and S. P . Vadhan. Why simple hash functions work: Exploiting the entropy in a data stream. Theory of Computing, 9:897–945, 2013

work page 2013

[62] [62]

E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comp. Syst. Sci., 55(3):441–453, 1997

work page 1997

[63] [63]

Cohen, M

E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P . Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations with- out support pruning. IEEE Transactions on Knowledge and Data Engineering, 13(1):64–78, 2001

work page 2001

[64] [64]

Cohen and H

E. Cohen and H. Kaplan. Leveraging discarded samples for tighter estimation of multiple-set aggregates. ACM SIGMETRICS Perfor- mance Evaluation Review, 37(1):251–262, 2009

work page 2009

[65] [65]

Cohen and D

J. Cohen and D. Kane. Bounds on the independence required for cuckoo hashing. http://web.mit.edu/jscohen/www/ cuckoo.pdf, 2009. [Online; accessed 29-April-2018]

work page 2009

[66] [66]

Dahlgaard, M

S. Dahlgaard, M. B. T. Knudsen, and M. Thorup. Fast similarity sketching. In Proc. FOCS ’17, pages 663–671, 2017

work page 2017

[67] [67]

Dasgupta, R

A. Dasgupta, R. Kumar, and T. Sarlós. Fast locality-sensitive hash- ing. In Proc. SIGKDD ’11, pages 1073–1081, 2011

work page 2011

[68] [68]

Datar, N

M. Datar, N. Immorlica, P . Indyk, and V . S. Mirrokni. Locality- sensitive hashing scheme based on p-stable distributions. In Proc. SOCG ’04, pages 253–262, 2004

work page 2004

[69] [69]

On the Complexity of Closest Pair via Polar-Pair of Point-Sets

R. David, Karthik C. S., and B. Laekhanukit. The curse of medium dimension for geometric problems in almost every norm. CoRR, abs/1608.03245, 2016. Bibliography 225

work page internal anchor Pith review Pith/arXiv arXiv 2016

[70] [70]

de Berg, M

M. de Berg, M. van Kreveld, M. Overmars, and O. C. Schwarzkopf. Computational geometry. Springer, Berlin, third edition, 2008

work page 2008

[71] [71]

Dietzfelbinger

M. Dietzfelbinger. On randomness in hash functions (invited talk). In Proc. STACS ’12, pages 25–28, 2012

work page 2012

[72] [72]

Dietzfelbinger and F

M. Dietzfelbinger and F. Meyer auf der Heide. A new universal class of hash functions and dynamic hashing in real time. In Proc. ICALP ’90, pages 6–19, 1990

work page 1990

[73] [73]

Dietzfelbinger and M

M. Dietzfelbinger and M. Rink. Applications of a splitting trick. In Proc. ICALP ’09, pages 354–365, 2009

work page 2009

[74] [74]

Dietzfelbinger and P

M. Dietzfelbinger and P . Woelfel. Almost random graphs with simple hash functions. In Proc. STOC ’03, pages 629–638, 2003

work page 2003

[75] [75]

M. Dubiner. Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem. IEEE Trans. Information Theory, 56(8):4166–4179, 2010

work page 2010

[76] [76]

Duhamel and M

P . Duhamel and M. Vetterli. Fast Fourier transforms: a tutorial review and state of the art. Signal Processing, 19(4):259–299, 1990

work page 1990

[77] [77]

Eshghi and S

K. Eshghi and S. Rajaram. Locality sensitive hash functions based on concomitant rank order statistics. In Proc. KDD ’08 , pages 221–229, 2008

work page 2008

[78] [78]

M. L. Fredman, J. Komlós, and E. Szemerédi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538–544, 1984

work page 1984

[79] [79]

Gao and T

S. Gao and T. Mateer. Additive fast Fourier transforms over ﬁnite ﬁelds. IEEE Trans. Inf. Theory, 56(12):6265–6272, 2010

work page 2010

[80] [80]

Gionis, P

A. Gionis, P . Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. VLDB ’99, pages 518–529, 1999

work page 1999