pith. sign in

arxiv: 1907.11803 · v1 · pith:3CPZH3OPnew · submitted 2019-07-26 · 💻 cs.DB

qwLSH: Cache-conscious Indexing for Processing Similarity Search Query Workloads in High-Dimensional Spaces

Pith reviewed 2026-05-24 14:50 UTC · model grok-4.3

classification 💻 cs.DB
keywords Locality Sensitive Hashingcache-conscious indexingsimilarity searchquery workloadshigh-dimensional spacesapproximate nearest neighborindex structures
0
0 comments X

The pith

qwLSH uses novel cost models to divide cache space for faster processing of similarity search query workloads in high-dimensional spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces qwLSH as an index structure that processes groups of similarity search queries more efficiently than standard LSH methods. Existing LSH approaches are built for isolated queries and ignore cache behavior across an entire workload. qwLSH applies new cost models that allocate cache resources according to data characteristics and workload patterns. The result is reduced cache misses and higher throughput when queries arrive together in domains such as image processing and machine learning. The central argument is that accounting for cache utilization during index construction directly improves performance on real-world query batches.

Core claim

qwLSH is an index structure for efficiently processing similarity search query workloads in high-dimensional spaces. It intelligently divides a given cache during processing of a query workload by using novel cost models. Experimental results show that, given a query workload, qwLSH is able to perform faster than existing techniques due to its unique cost models and strategies.

What carries the argument

Novel cost models that divide cache based on data characteristics and workload patterns to improve utilization during approximate similarity search.

If this is right

  • qwLSH executes similarity search workloads faster than prior LSH variants.
  • The index accounts for cache utilization explicitly in its design for high-dimensional data.
  • It mitigates the single-query focus limitation of standard LSH when queries arrive in batches.
  • The approach remains applicable in settings where exact indexing fails due to the curse of dimensionality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The cache-division strategy could be adapted to other approximate indexing families beyond LSH.
  • Workload-aware cost models might inform query scheduling policies inside database engines.
  • Further tests on varying dimensionality and cache sizes could identify the range where the models remain effective.

Load-bearing premise

The novel cost models accurately capture and predict cache utilization behavior for the given query workloads and data characteristics.

What would settle it

An experiment in which the cost models are applied to a workload but measured cache hit rates show no improvement or a drop, resulting in no speedup or slower execution than baseline LSH, would falsify the performance claim.

Figures

Figures reproduced from arXiv: 1907.11803 by John Ossorgin, Omid Jafari, Parth Nagarkar.

Figure 1
Figure 1. Figure 1: Effect of Cardinality over the ratio IndexIO/TotalIO. We create versions of the Deepsat dataset (see Section 6 for more details) with different cardinalities but same dim. [=1500] for 250 top-50 queries using QALSH alg. [11]. sacrifice accuracy for a much faster performance. Formally, the goal of the approximate version of the similarity search problem, also called c-approximate Nearest Neighbor search, is… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of Dimensionality over DataIO/TotalIO over the Deepsat dataset with different dimensionalities but same card. [=50K] for 250 top-50 queries using QALSH alg. [11]. 1.2 Motivation Often times, in real-world settings, queries arrive as part of a query workload. Several research works have shown the benefits of designing index structures particularly for effi￾cient handling of query workloads [3, 8, 17,… view at source ↗
Figure 3
Figure 3. Figure 3: Different Strategies for designing the cache 25K 50K 100K 250K 0 20 40 60 80 100 50 250 500 1500 3136 Optimal Setting for Index Cache Cardinality Size (in % of Total Cache Size) Dimensionality Optimal Setting for Index Cache Size [DeepSat dataset] 25K 50K 100K 250K [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Optimal Setting for Index Cache Size (in % of the Total Available Cache Size) on different versions of the Deepsat dataset [16MB Cache Size, |Q|=250] 0 200 400 600 800 0 20 40 60 80 100 TotalIO (in MB) Index Cache Size Setting (in %) Effect of Index Cache Size Setting on IO [P53 Dataset] IndexIO DataIO TotalIO [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of Index Cache Size Setting on the In￾dexIO, DataIO, and TotalIO for P53 dataset [16MB Cache Size, |Q|=250] on the size of CD . Thus, given a query workload Q, we want to find size(CI) (or size(CD )) such that cost(C) is minimized: minimize Õ |Q| д=1  αcar d × Õm i=1 (cost(C д Ii )) + αd im × Õw j=1 (cost(C д Dj )) subject to size(CI) = size(C) − size(CD ). 5 qwLSH In this section, we describe our… view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of varying top-k on P53 dataset [16MB Cache, |Q|=250] 6 Experimental Evaluation In this section, we evaluate the effectiveness of our proposed index structure, qwLSH. For these evaluations, we use several real data sets with different cardinality and dimensionality, under different system parameters. All experiments were run on machines with the following specifications: Intel Core i7-6700, 16GB RAM… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of qwLSH (TotalIO) against its alternatives (for different real datasets) and default settings 1,517.74 609.30 1,063.23 609.30 689.98 0 400 800 1200 1600 QA_Naive QA_Ci QA_CiCd QA_Opt qwLSH TotalIO (in MB) Audio [8MB Cache, |Q|=250, k=50] 1,041.33 433.39 501.04 396.10 431.93 0 300 600 900 1200 QA_Naive QA_Ci QA_CiCd QA_Opt qwLSH TotalIO (in MB) P53 [8MB Cache, |Q|=250, k=50] 1,517.74 38.44 357.8… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of qwLSH (TotalIO) for varying cache size against its alternatives (for Audio and P53 datasets) 319.21 17.53 114.95 14.74 17.09 0 90 180 270 360 QA_Naive QA_Ci QA_CiCd QA_Opt qwLSH TotalIO (in MB) Audio [16MB Cache, |Q|=50, k=50] 226.36 90.77 15.78 15.78 15.78 0 70 140 210 280 QA_Naive QA_Ci QA_CiCd QA_Opt qwLSH TotalIO (in MB) P53 [16MB Cache, |Q|=50, k=50] 626.94 23.14 234.43 17.94 24.29 0 175… view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of qwLSH (TotalIO) for varying number of queries against its alternatives (for Audio and P53 datasets) • QALSH_Ci: In this alternative, we allocate the maximum size of 99% (of the total cache size) to CI . Hence, CD will have 1% of the total cache size allocation. • QALSH_Cd: Here, we allocate the maximum size of 99% (of the total cache size) to CD . Hence, CI will have 1% of the total cache si… view at source ↗
Figure 11
Figure 11. Figure 11: Effect of caching overhead on Time (in ms) qwLSH can always adapt and return results closer to the optimal setting. Overhead of qwLSH’s Cache Implementation [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
read the original abstract

Similarity search queries in high-dimensional spaces are an important type of queries in many domains such as image processing, machine learning, etc. Since exact similarity search indexing techniques suffer from the well-known curse of dimensionality in high-dimensional spaces, approximate search techniques are often utilized instead. Locality Sensitive Hashing (LSH) has been shown to be an effective approximate search method for solving similarity search queries in high-dimensional spaces. Often times, queries in real-world settings arrive as part of a query workload. LSH and its variants are particularly designed to solve single queries effectively. They suffer from one major drawback while executing query workloads: they do not take into consideration important data characteristics for effective cache utilization while designing the index structures. In this paper, we present qwLSH, an index structure for efficiently processing similarity search query workloads in high-dimensional spaces. We intelligently divide a given cache during processing of a query workload by using novel cost models. Experimental results show that, given a query workload, qwLSH is able to perform faster than existing techniques due to its unique cost models and strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents qwLSH, an LSH-based index structure for similarity search query workloads in high-dimensional spaces. It introduces novel cost models to intelligently divide cache during workload processing for improved utilization, claiming that experimental results demonstrate faster performance than existing techniques due to these models and strategies.

Significance. If the cost models are shown to accurately predict cache behavior and yield reproducible gains, the work could advance cache-conscious approximate indexing for workload-driven similarity search in domains such as image processing and machine learning, addressing a limitation of standard LSH methods that focus on single queries.

major comments (1)
  1. [Abstract] Abstract: the claim that 'experimental results show that, given a query workload, qwLSH is able to perform faster than existing techniques due to its unique cost models and strategies' supplies no quantitative data, baselines, error bars, workload characteristics, or experimental setup, preventing verification of whether the cost models accurately capture cache utilization (the weakest assumption underlying the central claim).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback on the abstract. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'experimental results show that, given a query workload, qwLSH is able to perform faster than existing techniques due to its unique cost models and strategies' supplies no quantitative data, baselines, error bars, workload characteristics, or experimental setup, preventing verification of whether the cost models accurately capture cache utilization (the weakest assumption underlying the central claim).

    Authors: We agree that the abstract is a high-level summary and does not include quantitative experimental details such as specific speedups, baselines, workload characteristics, or setup information. This limits the ability to immediately verify the claims about cache utilization. In the revised manuscript, we will update the abstract to concisely include key quantitative results from the experiments (e.g., observed performance improvements and workload types) while maintaining its brevity. The body of the paper already contains the full experimental evaluation, including direct measurements of cache behavior and comparisons that support the cost models' effectiveness. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces qwLSH with novel cost models for cache-conscious indexing of query workloads in high-dimensional spaces using LSH variants. The central claim rests on these models enabling better performance, validated empirically in experiments. No equations, definitions, or derivations in the provided abstract or description reduce a prediction to a fitted input by construction, invoke self-citations as load-bearing uniqueness theorems, or rename known results. The cost models are presented as new contributions whose accuracy is tested externally rather than assumed tautologically. This is a standard empirical systems paper with independent content in its strategies and evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5726 in / 1045 out tokens · 20045 ms · 2026-05-24T14:50:45.577079+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Sort-based query-adaptive loading of r-trees

    Daniar Achakeev, Bernhard Seeger, and Peter Widmayer. Sort-based query-adaptive loading of r-trees. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management , CIKM ’12, pages 2080–2084, New York, NY, USA, 2012. ACM

  2. [2]

    Aly, Hazem Elmeleegy, Yan Qi, and Walid Aref

    Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, and Walid Aref. Kangaroo: Workload-aware processing of range data and range queries in hadoop. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, pages 397–406, New York, NY, USA, 2016. ACM

  3. [3]

    Aly, Ahmed R

    Ahmed M. Aly, Ahmed R. Mahmood, Mohamed S. Hassan, Walid G. Aref, Mourad Ouzzani, Hazem Elmeleegy, and Thamir Qadah. Aqwa: Adaptive query workload aware partitioning of big spatial data. Proc. VLDB Endow., 8(13):2062–2073, September 2015

  4. [4]

    Lsh forest: Self-tuning indexes for similarity search

    Mayank Bawa, Tyson Condie, and Prasanna Ganesan. Lsh forest: Self-tuning indexes for similarity search. In Proceedings of the 14th International Conference on World Wide Web, WWW ’05, pages 651– 660, New York, NY, USA, 2005. ACM

  5. [5]

    Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

    Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Na- ture Biotechnology, 33(6):623–630, jun 2015

  6. [6]

    Locality Sensitive Hashing for satellite images using texture feature vectors

    Ruben Buaba, Mohamed Gebril, Abdollah Homaifar, Eric Kihn, and Mikhail Zhizhin. Locality Sensitive Hashing for satellite images using texture feature vectors. In 2010 IEEE Aerospace Conference, pages 1–10. IEEE, mar 2010

  7. [7]

    Efficient large-scale sequence comparison by locality- sensitive hashing

    Jeremy Buhler. Efficient large-scale sequence comparison by locality- sensitive hashing. Bioinformatics, 17(5):419–428, may 2001

  8. [8]

    Schism: A workload-driven approach to database replication and partitioning

    Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. Schism: A workload-driven approach to database replication and partitioning. Proc. VLDB Endow., 3(1-2):48–57, September 2010

  9. [9]

    Mirrokni

    Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG ’04, pages 253–262, New York, NY, USA, 2004. ACM

  10. [10]

    Locality- sensitive hashing scheme based on dynamic collision counting

    Junhao Gan, Jianlin Feng, Qiong Fang, and Wilfred Ng. Locality- sensitive hashing scheme based on dynamic collision counting. In Proceedings of the 2012 ACM SIGMOD International Conference on Man- agement of Data , SIGMOD ’12, pages 541–552, New York, NY, USA,

  11. [11]

    Jagadish, Beng Chin Ooi, and Sheng Wang

    Jinyang Gao, H.V. Jagadish, Beng Chin Ooi, and Sheng Wang. Selec- tive hashing: Closing the gap between radius search and k-nn search. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD ’15, pages 349–358, New York, NY, USA, 2015. ACM

  12. [12]

    Hankins and Jignesh M

    Richard A. Hankins and Jignesh M. Patel. Effect of node size on the performance of cache-conscious b+-trees. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems , SIGMETRICS ’03, pages 283–294, New York, NY, USA, 2003. ACM

  13. [13]

    Query-aware locality-sensitive hashing for approximate nearest neighbor search

    Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng. Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proc. VLDB Endow., 9(1):1–12, September 2015

  14. [14]

    Approximate nearest neighbors: Towards removing the curse of dimensionality

    Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing , STOC ’98, pages 604–613, New York, NY, USA, 1998. ACM

  15. [15]

    Sk- lsh: An efficient index structure for approximate nearest neighbor search

    Yingfan Liu, Jiangtao Cui, Zi Huang, Hui Li, and Heng Tao Shen. Sk- lsh: An efficient index structure for approximate nearest neighbor search. Proc. VLDB Endow., 7(9):745–756, May 2014

  16. [16]

    Multi-probe lsh: Efficient indexing for high-dimensional similarity search

    Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe lsh: Efficient indexing for high-dimensional similarity search. In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB ’07, pages 950–961. VLDB Endowment, 2007

  17. [17]

    Selçuk Candan

    Parth Nagarkar and K. Selçuk Candan. HCS: hierarchical cut selection for efficiently processing queries on data columns using hierarchical bitmap indices. In Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014, Athens, Greece, March 24-28, 2014., pages 271–282, 2014

  18. [18]

    Selçuk Candan, and Aneesha Bhat

    Parth Nagarkar, K. Selçuk Candan, and Aneesha Bhat. Compressed spatial hierarchical bitmap (cshb) indexes for efficiently processing spatial range query workloads. PVLDB, 8(12):1382–1393, 2015

  19. [19]

    Skew-aware auto- matic database partitioning in shared-nothing, parallel oltp systems

    Andrew Pavlo, Carlo Curino, and Stanley Zdonik. Skew-aware auto- matic database partitioning in shared-nothing, parallel oltp systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, pages 61–72, New York, NY, USA,

  20. [20]

    Ashwin Kumar, and Amol Deshpande

    Abdul Quamar, K. Ashwin Kumar, and Amol Deshpande. Sword: Scal- able workload-aware data placement for transactional workloads. In Proceedings of the 16th International Conference on Extending Database Technology, EDBT ’13, pages 430–441, New York, NY, USA, 2013. ACM

  21. [21]

    Jun Rao and Kenneth A. Ross. Making b+- trees cache conscious in main memory. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data , SIGMOD ’00, pages 475–486, New York, NY, USA, 2000. ACM

  22. [22]

    16S rRNA metagenome clustering and diversity estimation using locality sensitive hashing

    Zeehasham Rasheed, Huzefa Rangwala, and Daniel Barbará. 16S rRNA metagenome clustering and diversity estimation using locality sensitive hashing. BMC Systems Biology, 7(Suppl 4):S11, oct 2013

  23. [23]

    Cache-sensitive skip list: Efficient range queries on modern cpus

    Stefan Sprenger, Steffen Zeuch, and Ulf Leser. Cache-sensitive skip list: Efficient range queries on modern cpus. In Spyros Blanas, Rajesh Bordawekar, Tirthankar Lahiri, Justin Levandoski, and Andrew Pavlo, editors, Data Management on New Hardware , pages 1–17, Cham, 2017. Springer International Publishing

  24. [24]

    Srs: Solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index

    Yifang Sun, Wei Wang, Jianbin Qin, Ying Zhang, and Xuemin Lin. Srs: Solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. Proc. VLDB Endow. , 8(1):1–12, September 2014

  25. [25]

    Streaming similarity search over one billion tweets using parallel locality-sensitive hashing

    Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden, and Pradeep Dubey. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proc. VLDB Endow., 6(14):1930–1941, Sep- tember 2013

  26. [26]

    Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

    Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst., 35(3):20:1–20:46, July 2010

  27. [27]

    Kostas Tzoumas, Man Lung Yiu, and Christian S. Jensen. Workload- aware indexing of continuously moving objects. Proc. VLDB Endow., 2(1):1186–1197, August 2009

  28. [28]

    C. E. Yoon, O. OReilly, K. J. Bergen, and G. C. Beroza. Earthquake detection through computationally efficient similarity search. Science Advances, 1(11):e1501057–e1501057, dec 2015

  29. [29]

    An experimental comparison of cache-oblivious and cache- conscious programs

    Kamen Yotov, Tom Roeder, Keshav Pingali, John Gunnels, and Fred Gustavson. An experimental comparison of cache-oblivious and cache- conscious programs. In Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures , SPAA ’07, pages 93–104, New York, NY, USA, 2007. ACM

  30. [30]

    Tung, and Sai Wu

    Yuxin Zheng, Qi Guo, Anthony K.H. Tung, and Sai Wu. Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index. In Proceedings of the 2016 International Conference on Management of Data , SIGMOD ’16, pages 2023–2037, New York, NY, USA, 2016. ACM