Algorithms for Similarity Search and Pseudorandomness
Pith reviewed 2026-05-25 18:11 UTC · model grok-4.3
The pith
An improved locality-sensitive hashing framework solves approximate near neighbor search with fewer hash evaluations and lower word-RAM complexity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The author claims that an improved locality-sensitive hashing framework reduces the number of evaluations of locality-sensitive hash functions and the word-RAM complexity for solving the approximate near neighbor problem, provides space-time tradeoffs with tight upper and lower bounds for cosine similarity, offers a novel approach for the problem on sets with a matching lower bound, shows optimality of a Boolean scheme, gives tight lower bounds for asymmetric hashing, and supplies deterministic and randomized methods for high-quality pseudorandom number generation with near-optimal space and time.
What carries the argument
The improved framework for the approximate near neighbor problem that uses locality-sensitive hashing to reduce hash evaluations and word-RAM complexity.
If this is right
- The number of locality-sensitive hash function evaluations drops compared to the standard framework.
- Word-RAM complexity improves for approximate near neighbor solutions.
- Tight upper and lower bounds characterize space-time tradeoffs under cosine similarity.
- A novel approach improves performance for search on sets while matching a lower bound.
- Deterministic algorithms generate high-quality pseudorandom numbers in constant time with near-optimal space.
Where Pith is reading between the lines
- Practical retrieval systems could adopt the self-tuning variant to improve performance without manual tuning of parameters.
- The matching lower bounds suggest that further asymptotic gains for set-based search are unlikely under the stated model.
- The pseudorandom constructions could strengthen randomized algorithms that need high-quality bits while staying within tight space limits.
Load-bearing premise
Locality-sensitive hash families exist that preserve the required similarities with the probabilities stated in the standard definitions.
What would settle it
A data distribution or similarity measure where the new framework requires the same number of hash evaluations as the standard framework would show that the claimed reduction does not hold.
Figures
read the original abstract
We study the problem of approximate near neighbor (ANN) search and show the following results: - An improved framework for solving the ANN problem using locality-sensitive hashing, reducing the number of evaluations of locality-sensitive hash functions and the word-RAM complexity compared to the standard framework. - A framework for solving the ANN problem with space-time tradeoffs as well as tight upper and lower bounds for the space-time tradeoff of framework solutions to the ANN problem under cosine similarity. - A novel approach to solving the ANN problem on sets along with a matching lower bound, improving the state of the art. - A self-tuning version of the algorithm is shown through experiments to outperform existing similarity join algorithms. - Tight lower bounds for asymmetric locality-sensitive hashing which has applications to the approximate furthest neighbor problem, orthogonal vector search, and annulus queries. - A proof of the optimality of a well-known Boolean locality-sensitive hashing scheme. We study the problem of efficient algorithms for producing high-quality pseudorandom numbers and obtain the following results: - A deterministic algorithm for generating pseudorandom numbers of arbitrarily high quality in constant time using near-optimal space. - A randomized construction of a family of hash functions that outputs pseudorandom numbers of arbitrarily high quality with space usage and running time nearly matching known cell-probe lower bounds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies approximate near neighbor (ANN) search and pseudorandomness. For ANN it claims an improved LSH framework reducing hash evaluations and word-RAM complexity versus the standard framework; a space-time tradeoff framework with tight upper/lower bounds under cosine similarity; a novel set-similarity approach with matching lower bound improving the state of the art; experimental evidence that a self-tuning variant outperforms existing similarity-join algorithms; tight lower bounds for asymmetric LSH (with applications to furthest-neighbor, orthogonal-vector, and annulus queries); and a proof that a well-known Boolean LSH scheme is optimal. For pseudorandomness it claims a deterministic constant-time generator of arbitrarily high quality using near-optimal space and a randomized hash-family construction whose space and time nearly match known cell-probe lower bounds.
Significance. If the stated reductions, tradeoffs, and matching bounds hold, the work would strengthen the theoretical foundations of LSH-based ANN and provide near-optimal pseudorandom constructions. The explicit matching of upper and lower bounds for the cosine-similarity space-time tradeoff and the set-similarity result are particularly valuable; the experimental self-tuning claim is independent of the theoretical contributions.
minor comments (2)
- The abstract states that the self-tuning algorithm 'outperforms existing similarity join algorithms' via experiments, but the manuscript should clarify the precise baselines, data sets, and metrics used so that the experimental claim can be evaluated independently of the theoretical sections.
- Notation for the improved LSH framework (number of hash evaluations, word-RAM cost) is introduced in the abstract without cross-reference to the section that defines the standard framework; adding an explicit pointer would improve readability.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, detailed summary of our contributions, and recommendation of minor revision. No specific major comments were raised in the report.
Circularity Check
No significant circularity; theoretical extensions are self-contained
full rationale
The paper presents algorithmic improvements to LSH-based ANN search, space-time tradeoffs with matching upper/lower bounds, a novel set-based approach, and pseudorandomness constructions. These are framed as new analyses, frameworks, and proofs extending standard LSH assumptions rather than redefining quantities in terms of themselves or calling fitted parameters predictions. No load-bearing self-citations or ansatzes are indicated in the abstract or claimed results; the work is self-contained against external benchmarks like prior LSH families and cell-probe lower bounds.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ACM paris kanellakis theory and practice award
ACM. ACM paris kanellakis theory and practice award. https: //awards.acm.org/award_winners/charikar_0308379,
-
[2]
[Online; accessed 26-April-2018]
work page 2018
-
[3]
C. Aggarwal, D. A. Keim, and A. Hinneburg. On the surprising behaviour of distance metrics in high dimensional space. In Proc. ICDT ’01, pages 420–434, 2001
work page 2001
-
[4]
A. Agresti. Bounds on the extinction time distribution of a branch- ing process. Advances in Applied Probability, 6(2):322–335, 1974
work page 1974
-
[5]
T. D. Ahle, M. Aumüller, and R. Pagh. Parameter-free locality sensitive hashing for spherical range reporting. In Proc. SODA ’17, pages 239–256, 2017
work page 2017
-
[6]
T. D. Ahle, R. Pagh, I. P . Razenshteyn, and F. Silvestri. On the complexity of inner product similarity join. In Proc. PODS’16 , pages 151–164, 2016
work page 2016
-
[7]
J. Alman and R. Williams. Probabilistic polynomials and hamming nearest neighbors. In Proc. FOCS ’15, pages 136–150, 2015
work page 2015
-
[8]
N. Alon and N. Asaf. k-wise independent random graphs. In Proc. FOCS ’08, pages 813–822, 2008
work page 2008
-
[9]
N. Alon, O. Goldreich, J. Håstad, and R. Peralta. Simple construc- tions of almost k-wise independent random variables. Random Structures & Algorithms, 3(3):289–304, 1992
work page 1992
-
[10]
A. Andoni. Nearest neighbor search: the old, the new, and the impossible. PhD thesis, MIT, 2009. 220 Bibliography
work page 2009
-
[11]
A. Andoni and P . Indyk. Efficient algorithms for substring near neighbor problem. In Proc. SODA ’06, pages 1203–1212, 2006
work page 2006
-
[12]
A. Andoni and P . Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proc. FOCS ’06, pages 459–468, 2006
work page 2006
-
[13]
A. Andoni and P . Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008
work page 2008
- [14]
- [15]
- [16]
- [17]
-
[18]
A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. STOC ’15, pages 793–801, 2015
work page 2015
- [19]
-
[20]
A. Andoni and I. Razensteyn. Tight lower bounds for data- dependent locality-sensitive hashing. In Proc. SoCG ’16 , pages 9:1–9:11, 2016
work page 2016
- [21]
-
[22]
N. Augsten and M. H. Böhlen. Similarity joins in relational database systems. Synthesis Lectures on Data Management, 5(5):1–124, 2013
work page 2013
-
[23]
M. Aumüller, T. Christiani, R. Pagh, and F. Silvestri. Distance- sensitive hashing. pages 89–104, 2018
work page 2018
-
[24]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. 16th World Wide Web Conference (WWW) , pages 131–140, 2007
work page 2007
- [25]
-
[26]
Y. Benyamini and J. Lindenstrauss. Geometric nonlinear functional analysis, volume 48. American Mathematical Soc., Providence, Rhode Island, 1998
work page 1998
-
[27]
M. Bhattacharya, R. Creutzburg, and J. Astola. Some historical notes on the number theoretic transform. In Proc. 2004 Int. TICS Workshop on Spectral Methods and Multirate Signal Processing , 2004
work page 2004
-
[28]
A. Bostan and É. Schost. Polynomial evaluation and interpolation on special sets of points. J. Complexity, 21(4):420–446, 2005
work page 2005
-
[29]
J. Braun-Blanquet. Plant sociology. The study of plant communities . McGraw-Hill, 1932
work page 1932
- [30]
-
[31]
A. Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings , pages 21–29. IEEE, 1997
work page 1997
-
[32]
A. Z. Broder. Identifying and filtering near-duplicate documents. In Proc. Symp. on Combinatorial Pattern Matching (CPM) , pages 1–10. Springer, 2000
work page 2000
-
[33]
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In Proc. STOC ’98, pages 327–336, 1998. 222 Bibliography
work page 1998
-
[34]
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min- wise independent permutations. J. Comput. Syst. Sci., 60(3):630–659, 2000
work page 2000
-
[35]
A. Z. Broder, A. M. Frieze, and E. Upfal. Static and dynamic path selection on expander graphs: a random walk approach. Random Structures & Algorithms, 14(1):87–109, 1999
work page 1999
-
[36]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syn- tactic clustering of the web. Computer Networks, 29(8-13):1157–1166, 1997
work page 1997
-
[37]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syn- tactic clustering of the web. Computer Networks and ISDN Systems, 29(8):1157–1166, 1997
work page 1997
-
[38]
H. Burhman, P . B. Miltersen, J. Radhakrishnan, and S. Venkatesh. Are bitvectors optimal? SIAM J. Comput., 31(6):1723–1744, 2002
work page 2002
-
[39]
M. Capalbo. Explicit bounded-degree unique-neighbor concentra- tors. Combinatorica, 25(4):379–391, 2005
work page 2005
-
[40]
M. Capalbo, O. Reingold, S. Vadhan, and A. Wigderson. Random- ness conductors and constant-degree lossless expanders. In Proc. STOC ’02, pages 659–668, 2002
work page 2002
-
[41]
J. L. Carter and M. N. Wegman. Universal classes of hash functions. In Proc. STOC ’77, pages 106–112, 1977
work page 1977
-
[42]
J. L. Carter and M. N. Wegman. Universal classes of hash functions. J. Comput. Syst. Sci., 18(2):143–154, 1979
work page 1979
-
[43]
J. L. Carter and M. N. Wegman. New hash functions and their use in authentication and set equality. J. Comput. System Sci., 22(3):265– 279, 1981
work page 1981
-
[44]
A. Chakrabarti, V . Satuluri, A. Srivathsan, and S. Parthasarathy. Locality sensitive hashing for similarity search and estima- tion. https://sites.google.com/site/lshallpairs/,
- [45]
-
[46]
A. Chakrabarti, V . Satuluri, A. Srivathsan, and S. Parthasarathy. A bayesian perspective on locality sensitive hashing with extensions for kernel methods. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(2):19, 2015
work page 2015
-
[47]
D. Chakraborty, E. Goldenberg, and M. Kouck `y. Streaming al- gorithms for embedding and computing edit distance in the low distance regime. In Proc. STOC ’16, pages 712–725, 2016
work page 2016
-
[48]
J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for sim- ulating stable random variables. Jour. Am. Stat. Assoc., 71(354):340– 344, 1976
work page 1976
- [49]
-
[50]
S. Chaudhuri, V . Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proc. 22nd Conference on Data Engineering (ICDE), page 5, 2006
work page 2006
-
[51]
F. Chierichetti and R. Kumar. Lsh-preserving functions and their applications. J. ACM, 62(5):33, 2015
work page 2015
-
[52]
F. Chierichetti, R. Kumar, A. Panconesi, and E. Terolli. The distor- tion of locality sensitive hashing. In Proc. ITCS ’17, pages 54:1–54:18, 2017
work page 2017
-
[53]
S. Choi, S. Cha, and C. C. Tappert. A survey of binary similarity and distance measures. J. Syst. Cybern. Informatics, 8(1):43–48, 2010
work page 2010
-
[54]
B. Chor, O. Goldreich, J. Hastad, J. Freidmann, S. Rudich, and R. Smolensky. The bit extration problem or t-resilient functions. In Proc. FOCS ’85, pages 396–407, 1985
work page 1985
-
[55]
Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search
T. Christiani. Fast locality-sensitive hashing for approximate near neighbor search. CoRR, abs/1708.07586, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[56]
T. Christiani. A framework for similarity search with space-time tradeoffs using locality-sensitive filtering. In Proc. SODA ’17, pages 31–46, 2017
work page 2017
-
[57]
T. Christiani and R. Pagh. Generating k-independent variables in constant time. In Proc. FOCS ’14, pages 196–205, 2014. 224 Bibliography
work page 2014
-
[58]
T. Christiani and R. Pagh. Set similarity search beyond minhash. In Proc. STOC ’17, pages 1094–1107, 2017
work page 2017
-
[59]
Scalable and robust set similarity join
T. Christiani, R. Pagh, and J. Sivertsen. Scalable and robust set similarity join. CoRR, abs/1707.06814, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[60]
T. Christiani, R. Pagh, and M. Thorup. From independence to expansion and back again. In Proc. STOC ’15, pages 813–820, 2015
work page 2015
- [61]
-
[62]
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comp. Syst. Sci., 55(3):441–453, 1997
work page 1997
- [63]
-
[64]
E. Cohen and H. Kaplan. Leveraging discarded samples for tighter estimation of multiple-set aggregates. ACM SIGMETRICS Perfor- mance Evaluation Review, 37(1):251–262, 2009
work page 2009
-
[65]
J. Cohen and D. Kane. Bounds on the independence required for cuckoo hashing. http://web.mit.edu/jscohen/www/ cuckoo.pdf, 2009. [Online; accessed 29-April-2018]
work page 2009
-
[66]
S. Dahlgaard, M. B. T. Knudsen, and M. Thorup. Fast similarity sketching. In Proc. FOCS ’17, pages 663–671, 2017
work page 2017
-
[67]
A. Dasgupta, R. Kumar, and T. Sarlós. Fast locality-sensitive hash- ing. In Proc. SIGKDD ’11, pages 1073–1081, 2011
work page 2011
- [68]
-
[69]
On the Complexity of Closest Pair via Polar-Pair of Point-Sets
R. David, Karthik C. S., and B. Laekhanukit. The curse of medium dimension for geometric problems in almost every norm. CoRR, abs/1608.03245, 2016. Bibliography 225
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[70]
M. de Berg, M. van Kreveld, M. Overmars, and O. C. Schwarzkopf. Computational geometry. Springer, Berlin, third edition, 2008
work page 2008
-
[71]
M. Dietzfelbinger. On randomness in hash functions (invited talk). In Proc. STACS ’12, pages 25–28, 2012
work page 2012
-
[72]
M. Dietzfelbinger and F. Meyer auf der Heide. A new universal class of hash functions and dynamic hashing in real time. In Proc. ICALP ’90, pages 6–19, 1990
work page 1990
-
[73]
M. Dietzfelbinger and M. Rink. Applications of a splitting trick. In Proc. ICALP ’09, pages 354–365, 2009
work page 2009
-
[74]
M. Dietzfelbinger and P . Woelfel. Almost random graphs with simple hash functions. In Proc. STOC ’03, pages 629–638, 2003
work page 2003
-
[75]
M. Dubiner. Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem. IEEE Trans. Information Theory, 56(8):4166–4179, 2010
work page 2010
-
[76]
P . Duhamel and M. Vetterli. Fast Fourier transforms: a tutorial review and state of the art. Signal Processing, 19(4):259–299, 1990
work page 1990
-
[77]
K. Eshghi and S. Rajaram. Locality sensitive hash functions based on concomitant rank order statistics. In Proc. KDD ’08 , pages 221–229, 2008
work page 2008
-
[78]
M. L. Fredman, J. Komlós, and E. Szemerédi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538–544, 1984
work page 1984
- [79]
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.