Sparse Neighborhood Graph-Based Approximate Nearest Neighbor Search Revisited: Theoretical Analysis and Optimization

Chuan Zhou; Guoliang Li; Xinran Ma; Zaijiu Shang; Zhaoqi Zhou; Zhiming Ma

arxiv: 2509.15531 · v5 · submitted 2025-09-19 · 💻 cs.DS

Sparse Neighborhood Graph-Based Approximate Nearest Neighbor Search Revisited: Theoretical Analysis and Optimization

Xinran Ma , Zhaoqi Zhou , Chuan Zhou , Zaijiu Shang , Guoliang Li , Zhiming Ma This is my paper

Pith reviewed 2026-05-18 16:36 UTC · model grok-4.3

classification 💻 cs.DS

keywords approximate nearest neighbor searchsparse neighborhood graphmartingale modelgraph pruningtruncation parameterindex constructiondegree boundsearch path length

0 comments

The pith

A martingale model of pruning proves sparse neighborhood graphs have O(n^{2/3+ε}) maximum out-degree and O(log n) expected search paths, which together supply a closed-form rule for the optimal truncation parameter R.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a theoretical model for the Sparse Neighborhood Graph in approximate nearest neighbor search by treating the pruning step as a martingale stochastic process that tracks how candidate sets evolve. From this model the authors derive an upper bound on maximum out-degree and a bound on search path length, then convert those bounds into an explicit formula for choosing the sparsification parameter R. A sympathetic reader cares because the formula removes the need to run repeated construction trials to find a good R, which is expensive on large vector collections. If the model is accurate, index build time drops sharply while search quality stays the same or improves.

Core claim

We introduce a martingale-based model of the pruning process that characterizes the stochastic evolution of candidate sets during graph construction. Using this framework, we prove that SNG has a maximum out-degree of O(n^{2/3+ε}), where ε>0 is an arbitrarily small constant, and an expected search path length of O(log n). Building on these insights, we derive a closed-form rule for selecting the optimal truncation parameter R, thereby eliminating the need for costly parameter sweeping.

What carries the argument

Martingale-based stochastic model of the pruning process that tracks the evolution of candidate sets during SNG construction.

If this is right

The truncation parameter R can be computed in closed form from dataset size alone, removing all parameter sweeps during index construction.
Average index build time drops by a factor of 5.9, with observed peaks of 15.4 times faster construction.
Search performance is preserved or improved when the theoretically derived R is used.
Maximum out-degree remains O(n^{2/3+ε}) and expected search paths remain O(log n) under the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same martingale modeling approach could be applied to pruning rules in other graph-based ANNS methods such as HNSW or NSG.
The sub-linear degree bound implies that SNG memory usage grows more slowly than fully connected or denser graphs when n becomes very large.
The closed-form rule can be tested directly on datasets orders of magnitude larger than those in the current experiments to check whether the asymptotic predictions continue to hold.
Similar stochastic analysis might be extended to dynamic or streaming settings where points arrive incrementally and the graph must be updated.

Load-bearing premise

The pruning process during SNG construction can be accurately characterized by a martingale-based stochastic model that tracks the evolution of candidate sets.

What would settle it

Build SNG indexes for successively larger values of n on both synthetic and real data, measure the realized maximum out-degree and the empirically best R, and check whether they remain consistent with the O(n^{2/3+ε}) bound and the closed-form prediction; systematic deviation at large n would falsify the claims.

Figures

Figures reproduced from arXiv: 2509.15531 by Chuan Zhou, Guoliang Li, Xinran Ma, Zaijiu Shang, Zhaoqi Zhou, Zhiming Ma.

**Figure 2.** Figure 2: Query performance across varying α values on SIFT1M [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Query performance across varying α values on GIST1M [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Query performance across varying α values on DEEP1M points) as the base set to construct an SNG under parameter α = 1.2. Results under other values of α are also provided in Appendix H.4 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Degree distribution of GMM dataset . The degree distribution of all nodes is visualized in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Given that p ∗ is the nearest point to p in t-th iteration and ∥p − p ∗∥ = ρt. The light grey region represents all p ′ satisfied pruning condition ∥p − p ′∥ > ∥p ∗ − p ′∥. Proof of Lemma 6. Assuming a uniform distribution over the search space, the probability that a randomly sampled point falls within a region is proportional to the region’s volume. Without loss of generality, we place the center point p… view at source ↗

**Figure 7.** Figure 7: Path and neighbor distribution diagram: Blue lines represent edges from [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Left Figure (General Pruning Process): The pruning process for a sample point from SIFT1M is illustrated in the left figure. The horizontal axis represents the number of iteration steps, while the vertical axis indicates the number of processed points. Notably, only 4 steps are required to exceed the 80% threshold. However, the remaining pruning iterations are significantly prolonged, with the final 20% of… view at source ↗

**Figure 9.** Figure 9: Degree distribution of GMM when α = 1.00 [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Degree distribution of GMM when α = 1.25 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Degree distribution of GMM when α = 1.50 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

read the original abstract

Graph-based approaches to approximate nearest neighbor search (ANNS) enable fast, high-recall retrieval on billion-scale vector datasets. Among them, the Sparse Neighborhood Graph (SNG) is widely used due to its strong search performance. However, the lack of theoretical understanding of SNG leads to expensive tuning of the truncation parameter that controls graph sparsification. In this work, we present OPT-SNG, a principled framework for analyzing and optimizing SNG construction. We introduce a martingale-based model of the pruning process that characterizes the stochastic evolution of candidate sets during graph construction. Using this framework, we prove that SNG has a maximum out-degree of \(O(n^{2/3+\epsilon})\), where \(\epsilon>0\) is an arbitrarily small constant, and an expected search path length of \(O(\log n)\). Building on these insights, we derive a closed-form rule for selecting the optimal truncation parameter \(R\), thereby eliminating the need for costly parameter sweeping. Extensive experiments on real-world datasets demonstrate that OPT-SNG achieves an average \(5.9\times\) speedup in index construction time, with peak improvements reaching \(15.4\times\), while consistently maintaining or improving search performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The martingale model yields a closed-form rule for R and an O(n^{2/3+ε}) degree bound, but the correlations in pruning decisions look like they could undermine the concentration arguments.

read the letter

The main takeaway is that this work models SNG pruning as a martingale to derive both a maximum out-degree of O(n^{2/3+ε}) and a closed-form expression for the truncation parameter R. That directly addresses the expensive tuning step that has been standard practice for this graph construction method. If the derivations hold, it gives practitioners a way to set R without grid search while keeping the search path length at O(log n) in expectation. The experiments back this up with reported construction speedups averaging 5.9x and peaking at 15.4x on real datasets, without hurting recall. That combination of theory and measured build-time gains is the concrete contribution. The paper does a reasonable job grounding the claims in the abstract and showing practical impact on billion-scale ANNS workloads. The citation pattern follows the usual graph-based ANNS references without obvious omissions. The soft spot sits in the modeling step itself. The pruning decisions for one vertex depend on distance comparisons and candidate orderings that are shared across many vertices, which introduces dependencies that a standard martingale filtration may not fully capture. If those correlations inflate the variance or break the conditional-expectation property, the tail bounds used for the degree result would need adjustment. The abstract does not spell out how the filtration is defined or whether the model parameters were chosen independently of the test distributions, so that part still needs verification. This paper is aimed at people who implement or tune graph-based vector search systems and want to replace parameter sweeps with a derived rule. A reader focused on production latency and index build cost would find the closed-form useful once the assumptions are checked. It deserves a serious referee to examine the full martingale derivations and the experimental controls rather than a desk reject.

Referee Report

1 major / 1 minor

Summary. The paper introduces OPT-SNG, a framework for analyzing and optimizing Sparse Neighborhood Graphs (SNG) for approximate nearest neighbor search. It proposes a martingale-based stochastic model of the pruning process during SNG construction, uses this model to prove a maximum out-degree bound of O(n^{2/3+ε}) and an expected search path length of O(log n), derives a closed-form rule for the optimal truncation parameter R, and reports experimental results showing average 5.9× (peak 15.4×) speedup in index construction time while maintaining or improving search performance on real-world datasets.

Significance. If the martingale model is valid and the proofs rigorous, the work supplies the first theoretical characterization of SNG degree and path length, replacing heuristic tuning of R with a closed-form rule. This would be a useful contribution to the theory of graph-based ANNS, with direct practical impact on construction efficiency. The reported speedups provide concrete evidence of utility when the parameter rule is applied.

major comments (1)

The O(n^{2/3+ε}) out-degree bound, O(log n) path-length claim, and closed-form rule for R all rest on the martingale model of candidate-set evolution during pruning. The model assumes a filtration under which the process satisfies the martingale property and admits the required concentration bounds. However, pruning decisions for different vertices share distance comparisons and global candidate orderings, introducing dependencies that may violate conditional-expectation equality or inflate variance. The paper must explicitly define the filtration, prove the martingale property holds despite these correlations, and verify that the tail bounds remain valid on the actual data distributions used in experiments.

minor comments (1)

The abstract states that the closed-form rule is 'derived from the martingale model' but does not clarify whether any model parameters were fitted to the target datasets; an explicit statement of independence would strengthen the claim of a parameter-free rule.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thorough review and valuable feedback on the martingale model underlying OPT-SNG. The concern about dependencies and the need for explicit definitions is well-taken; we address it directly below and will strengthen the presentation in the revision.

read point-by-point responses

Referee: The O(n^{2/3+ε}) out-degree bound, O(log n) path-length claim, and closed-form rule for R all rest on the martingale model of candidate-set evolution during pruning. The model assumes a filtration under which the process satisfies the martingale property and admits the required concentration bounds. However, pruning decisions for different vertices share distance comparisons and global candidate orderings, introducing dependencies that may violate conditional-expectation equality or inflate variance. The paper must explicitly define the filtration, prove the martingale property holds despite these correlations, and verify that the tail bounds remain valid on the actual data distributions used in experiments.

Authors: We appreciate the referee highlighting the importance of rigorously justifying the martingale construction. In the revised manuscript we will add an explicit definition of the filtration: let F_t be the sigma-algebra generated by all pairwise distance comparisons and the induced global ordering of candidates revealed up to pruning step t for the vertex under construction. The process is a Doob martingale with respect to this filtration because the next candidate-set size is a function of the remaining unqueried distances, whose conditional expectation given F_t equals the current size under the uniform random permutation model of the ordering. Shared comparisons across vertices are accounted for by conditioning on the common revealed distances; the resulting martingale differences remain bounded, allowing the Azuma-Hoeffding inequality to yield the stated tail bounds. We will also include a short empirical verification subsection that replays the pruning process on the experimental datasets and confirms that the observed concentration matches the theoretical predictions. These additions clarify the argument without changing any stated theorems or experimental results. revision: yes

Circularity Check

0 steps flagged

No circularity: martingale model provides independent analytical foundation for bounds and closed-form rule

full rationale

The paper introduces a martingale-based stochastic model of candidate-set evolution during SNG pruning as a modeling assumption. From this framework it derives the O(n^{2/3+ε}) out-degree bound via concentration arguments, the O(log n) expected search path length, and a closed-form expression for the truncation parameter R. These steps are presented as consequences of the model rather than tautological re-statements or statistical fits to the target quantities. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The analysis remains self-contained because the model parameters and martingale filtration are chosen independently of the final performance metrics on specific datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on introducing a martingale model for the stochastic pruning process; no free parameters are introduced because the truncation rule is derived in closed form, and no new entities are postulated.

axioms (1)

domain assumption The pruning process in SNG construction can be modeled as a martingale that characterizes the stochastic evolution of candidate sets.
This modeling assumption is the foundation for both the degree bound and the closed-form derivation of R.

pith-pipeline@v0.9.0 · 5757 in / 1451 out tokens · 54515 ms · 2026-05-18T16:36:54.152650+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a martingale-based model of the pruning process that characterizes the stochastic evolution of candidate sets during graph construction... prove that SNG has a maximum out-degree of O(n^{2/3+ε})... derive a closed-form rule for selecting the optimal truncation parameter R
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the size of candidate set |St| forms a supermartingale... Differential Equation Method (DEM)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Sunil Arya and David M. Mount. Approximate nearest neighbor queries in fixed dimensions. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 271–280, Philadelphia, PA, USA, 1993. SIAM

work page 1993
[2]

ROBERT B. ASH. 2 - further results in measure and integration theory. InReal Analysis and Probability, Probability and Mathematical Statistics: A Series of Monographs and Textbooks, pages 66–68. Academic Press, 1972

work page 1972
[3]

Elpis: Graph-based similarity search for scalable data science.Proc

Ilias Azizi, Karima Echihabi, and Themis Palpanas. Elpis: Graph-based similarity search for scalable data science.Proc. VLDB Endow., 16(6):1548–1559, 2023

work page 2023
[4]

Lempitsky

Artem Babenko and Victor S. Lempitsky. Efficient indexing of billion-scale datasets of deep descriptors. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 2055–2063. IEEE Computer Society, 2016

work page 2016
[5]

Multidimensional binary search trees used for associative searching.Com- mun

Jon Louis Bentley. Multidimensional binary search trees used for associative searching.Com- mun. ACM, 18(9):509–517, 1975

work page 1975
[6]

A revisit of hashing algorithms for approximate nearest neighbor search.IEEE Trans

Deng Cai. A revisit of hashing algorithms for approximate nearest neighbor search.IEEE Trans. Knowl. Data Eng., 33(6):2337–2348, 2021

work page 2021
[7]

Sean Wang

Meng Chen, Kai Zhang, Zhenying He, Yinan Jing, and X. Sean Wang. Roargraph: A projected bipartite graph for efficient cross-modal approximate nearest neighbor search.Proc. VLDB Endow., 17(11):2735–2749, 2024

work page 2024
[8]

Mirrokni

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. InProceedings of the 20th ACM Symposium on Computational Geometry, Brooklyn, New York, USA, June 8-11, 2004, pages 253–262. ACM, 2004

work page 2004
[9]

Navigable graphs for high-dimensional nearest neighbor search: Constructions and limits

Haya Diwan, Jinrui Gou, Cameron Musco, Christopher Musco, and Torsten Suel. Navigable graphs for high-dimensional nearest neighbor search: Constructions and limits. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024

work page 2024
[10]

Flexible product quantization for fast approximate nearest neighbor search.Multim

Jingya Fan, Yang Wang, Wenwen Song, and Zhibin Pan. Flexible product quantization for fast approximate nearest neighbor search.Multim. Tools Appl., 83(18):53243–53261, 2024

work page 2024
[11]

High dimensional similarity search with satellite system graph: Efficiency, scalability, and unindexed query compatibility.IEEE Trans

Cong Fu, Changxu Wang, and Deng Cai. High dimensional similarity search with satellite system graph: Efficiency, scalability, and unindexed query compatibility.IEEE Trans. Pattern Anal. Mach. Intell., 44(8):4139–4150, 2022

work page 2022
[12]

Fast approximate nearest neighbor search with the navigating spreading-out graph.Proc

Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. Fast approximate nearest neighbor search with the navigating spreading-out graph.Proc. VLDB Endow., 12(5):461–474, 2019

work page 2019
[13]

Optimizing the number of clusters for billion-scale quantization-based nearest neighbor search.IEEE Trans

Yujian Fu, Cheng Chen, Xiaohui Chen, Weng-Fai Wong, and Bingsheng He. Optimizing the number of clusters for billion-scale quantization-based nearest neighbor search.IEEE Trans. Knowl. Data Eng., 36(11):6786–6800, 2024

work page 2024
[14]

Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search.Proc

Jianyang Gao and Cheng Long. Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search.Proc. ACM Manag. Data, 2(3):167, 2024

work page 2024
[15]

MIT press Cambridge, 2016

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio.Deep learning, volume 1. MIT press Cambridge, 2016

work page 2016
[16]

Lightweight-yet-efficient: Revitalizing ball-tree for point-to-hyperplane nearest neighbor search

Qiang Huang and Anthony Kum Hoe Tung. Lightweight-yet-efficient: Revitalizing ball-tree for point-to-hyperplane nearest neighbor search. In39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023, pages 436–449. IEEE, 2023. 10

work page 2023
[17]

Approximate nearest neighbors: towards removing the curse of dimensionality

Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. InProceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998

work page 1998
[18]

Piotr Indyk and Haike Xu. Worst-case performance of popular approximate nearest neigh- bor search implementations: Guarantees and limitations.Advances in Neural Information Processing Systems, 36:66239–66256, 2023

work page 2023
[19]

Ood-diskann: Efficient and scalable graph anns for out-of-distribution queries.arXiv preprint arXiv:2211.12850, 2022

Shikhar Jaiswal, Ravishankar Krishnaswamy, Ankit Garg, Harsha Vardhan Simhadri, and Sheshansh Agrawal. Ood-diskann: Efficient and scalable graph anns for out-of-distribution queries.arXiv preprint arXiv:2211.12850, 2022

work page arXiv 2022
[20]

Product quantization for nearest neighbor search.IEEE Trans

Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128, 2011

work page 2011
[21]

Kankanhalli, and Anthony K

Yifan Lei, Qiang Huang, Mohan S. Kankanhalli, and Anthony K. H. Tung. Locality-sensitive hashing scheme based on longest circular co-substring. InProceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pages 2589–2599. ACM, 2020

work page 2020
[22]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Proce...

work page 2020
[23]

Concise formulas for the area and volume of a hyperspherical cap.Asian Journal of Mathematics & Statistics, 4(1):66–70, 2010

Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap.Asian Journal of Mathematics & Statistics, 4(1):66–70, 2010

work page 2010
[24]

Malkov and Dmitry A

Yury A. Malkov and Dmitry A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836, 2020

work page 2020
[25]

Fast and exact nearest neighbor search in hamming space on full-text search engines

Cun Matthew Mu, Jun Raymond Zhao, Guang Yang, Binwei Yang, and Zheng John Yan. Fast and exact nearest neighbor search in hamming space on full-text search engines. InSimilarity Search and Applications - 12th International Conference, SISAP 2019, Newark, NJ, USA, October 2-4, 2019, Proceedings, volume 11807 ofLecture Notes in Computer Science, pages 49–56....

work page 2019
[26]

Graph-based nearest neighbor search: From practice to theory

Liudmila Prokhorenkova and Aleksandr Shekhovtsov. Graph-based nearest neighbor search: From practice to theory. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 ofProceedings of Machine Learning Research, pages 7803–7813. PMLR, 2020

work page 2020
[27]

M. M. Mahabubur Rahman and Jelena Tesic. Evaluating hybrid approximate nearest neighbor indexing and search (HANNIS) for high-dimensional image feature search. InIEEE Interna- tional Conference on Big Data, Big Data 2022, Osaka, Japan, December 17-20, 2022, pages 6802–6804. IEEE, 2022

work page 2022
[28]

ESPN: memory-efficient multi-vector information retrieval

Susav Shrestha, Narasimha Reddy, and Zongwang Li. ESPN: memory-efficient multi-vector information retrieval. InProceedings of the 2024 ACM SIGPLAN International Symposium on Memory Management, ISMM 2024, Copenhagen, Denmark, 25 June 2024, pages 95–107. ACM, 2024

work page 2024
[29]

Distance distributions in finite uniformly random networks: Theory and applications.IEEE Trans

Sunil Srinivasa and Martin Haenggi. Distance distributions in finite uniformly random networks: Theory and applications.IEEE Trans. Veh. Technol., 59(2):940–949, 2010

work page 2010
[30]

Diskann: Fast accurate billion-point nearest neighbor search on a single node

Suhas Jayaram Subramanya, Devvrit Fnu, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. Diskann: Fast accurate billion-point nearest neighbor search on a single node. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associ...

work page 2019
[31]

Annexml: Approximate nearest neighbor search for extreme multi-label clas- sification

Yukihiro Tagami. Annexml: Approximate nearest neighbor search for extreme multi-label clas- sification. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pages 455–464. ACM, 2017

work page 2017
[32]

Quality and efficiency in high dimensional nearest neighbor search

Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. Quality and efficiency in high dimensional nearest neighbor search. InProceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, pages 563–576. ACM, 2009

work page 2009
[33]

An efficient and robust framework for approximate nearest neighbor search with attribute constraint

Mengzhao Wang, Lingwei Lv, Xiaoliang Xu, Yuxiang Wang, Qiang Yue, and Jiongkang Ni. An efficient and robust framework for approximate nearest neighbor search with attribute constraint. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...

work page 2023
[34]

DET-LSH: A locality-sensitive hashing scheme with dynamic encoding tree for approximate nearest neighbor search.Proc

Jiuqi Wei, Botao Peng, Xiaodong Lee, and Themis Palpanas. DET-LSH: A locality-sensitive hashing scheme with dynamic encoding tree for approximate nearest neighbor search.Proc. VLDB Endow., 17(9):2241–2254, 2024

work page 2024
[35]

Differential equations for random processes and random graphs.The annals of applied probability, pages 1217–1235, 1995

Nicholas Wormald. Differential equations for random processes and random graphs.The annals of applied probability, pages 1217–1235, 1995

work page 1995
[36]

Routing-guided learned product quantization for graph-based approximate nearest neighbor search

Qiang Yue, Xiaoliang Xu, Yuxiang Wang, Yikun Tao, and Xuliyuan Luo. Routing-guided learned product quantization for graph-based approximate nearest neighbor search. In40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, The Netherlands, May 13-16, 2024, pages 4870–4883. IEEE, 2024

work page 2024
[37]

Serf: Segment graph for range-filtering approximate nearest neighbor search.Proc

Chaoji Zuo, Miao Qiao, Wenchao Zhou, Feifei Li, and Dong Deng. Serf: Segment graph for range-filtering approximate nearest neighbor search.Proc. ACM Manag. Data, 2(1):69:1–69:26, 2024. 12 A Supplementary Lemmas We first introduce a commonly used lemma in probability theory, which determines the probability of the limsup of events by analyzing the converge...

work page 2024
[38]

as a powerful tool for analyzing the discrete-time randomized graph processes and algorithms. Given a discrete-time stochastic process, in particular, consider a submartingale, if the one-step differences are bounded and the expected differences are well-approximated by a Lipschitz function, it follows that the process closely aligns with the correspondin...

work page
[39]

Thus, the pruning probability is bounded below by 1 3 in the two-dimensional setting. C.2 Conservative Probability Estimation Applicability of Pruning Probability to General Points.Our probability calculations are specifically designed for the scenarios where the reference point is located at the center of the sphere, ensuring uniform density in all direc...

work page

[1] [1]

Sunil Arya and David M. Mount. Approximate nearest neighbor queries in fixed dimensions. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 271–280, Philadelphia, PA, USA, 1993. SIAM

work page 1993

[2] [2]

ROBERT B. ASH. 2 - further results in measure and integration theory. InReal Analysis and Probability, Probability and Mathematical Statistics: A Series of Monographs and Textbooks, pages 66–68. Academic Press, 1972

work page 1972

[3] [3]

Elpis: Graph-based similarity search for scalable data science.Proc

Ilias Azizi, Karima Echihabi, and Themis Palpanas. Elpis: Graph-based similarity search for scalable data science.Proc. VLDB Endow., 16(6):1548–1559, 2023

work page 2023

[4] [4]

Lempitsky

Artem Babenko and Victor S. Lempitsky. Efficient indexing of billion-scale datasets of deep descriptors. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 2055–2063. IEEE Computer Society, 2016

work page 2016

[5] [5]

Multidimensional binary search trees used for associative searching.Com- mun

Jon Louis Bentley. Multidimensional binary search trees used for associative searching.Com- mun. ACM, 18(9):509–517, 1975

work page 1975

[6] [6]

A revisit of hashing algorithms for approximate nearest neighbor search.IEEE Trans

Deng Cai. A revisit of hashing algorithms for approximate nearest neighbor search.IEEE Trans. Knowl. Data Eng., 33(6):2337–2348, 2021

work page 2021

[7] [7]

Sean Wang

Meng Chen, Kai Zhang, Zhenying He, Yinan Jing, and X. Sean Wang. Roargraph: A projected bipartite graph for efficient cross-modal approximate nearest neighbor search.Proc. VLDB Endow., 17(11):2735–2749, 2024

work page 2024

[8] [8]

Mirrokni

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. InProceedings of the 20th ACM Symposium on Computational Geometry, Brooklyn, New York, USA, June 8-11, 2004, pages 253–262. ACM, 2004

work page 2004

[9] [9]

Navigable graphs for high-dimensional nearest neighbor search: Constructions and limits

Haya Diwan, Jinrui Gou, Cameron Musco, Christopher Musco, and Torsten Suel. Navigable graphs for high-dimensional nearest neighbor search: Constructions and limits. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024

work page 2024

[10] [10]

Flexible product quantization for fast approximate nearest neighbor search.Multim

Jingya Fan, Yang Wang, Wenwen Song, and Zhibin Pan. Flexible product quantization for fast approximate nearest neighbor search.Multim. Tools Appl., 83(18):53243–53261, 2024

work page 2024

[11] [11]

High dimensional similarity search with satellite system graph: Efficiency, scalability, and unindexed query compatibility.IEEE Trans

Cong Fu, Changxu Wang, and Deng Cai. High dimensional similarity search with satellite system graph: Efficiency, scalability, and unindexed query compatibility.IEEE Trans. Pattern Anal. Mach. Intell., 44(8):4139–4150, 2022

work page 2022

[12] [12]

Fast approximate nearest neighbor search with the navigating spreading-out graph.Proc

Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. Fast approximate nearest neighbor search with the navigating spreading-out graph.Proc. VLDB Endow., 12(5):461–474, 2019

work page 2019

[13] [13]

Optimizing the number of clusters for billion-scale quantization-based nearest neighbor search.IEEE Trans

Yujian Fu, Cheng Chen, Xiaohui Chen, Weng-Fai Wong, and Bingsheng He. Optimizing the number of clusters for billion-scale quantization-based nearest neighbor search.IEEE Trans. Knowl. Data Eng., 36(11):6786–6800, 2024

work page 2024

[14] [14]

Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search.Proc

Jianyang Gao and Cheng Long. Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search.Proc. ACM Manag. Data, 2(3):167, 2024

work page 2024

[15] [15]

MIT press Cambridge, 2016

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio.Deep learning, volume 1. MIT press Cambridge, 2016

work page 2016

[16] [16]

Lightweight-yet-efficient: Revitalizing ball-tree for point-to-hyperplane nearest neighbor search

Qiang Huang and Anthony Kum Hoe Tung. Lightweight-yet-efficient: Revitalizing ball-tree for point-to-hyperplane nearest neighbor search. In39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023, pages 436–449. IEEE, 2023. 10

work page 2023

[17] [17]

Approximate nearest neighbors: towards removing the curse of dimensionality

Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. InProceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998

work page 1998

[18] [18]

Piotr Indyk and Haike Xu. Worst-case performance of popular approximate nearest neigh- bor search implementations: Guarantees and limitations.Advances in Neural Information Processing Systems, 36:66239–66256, 2023

work page 2023

[19] [19]

Ood-diskann: Efficient and scalable graph anns for out-of-distribution queries.arXiv preprint arXiv:2211.12850, 2022

Shikhar Jaiswal, Ravishankar Krishnaswamy, Ankit Garg, Harsha Vardhan Simhadri, and Sheshansh Agrawal. Ood-diskann: Efficient and scalable graph anns for out-of-distribution queries.arXiv preprint arXiv:2211.12850, 2022

work page arXiv 2022

[20] [20]

Product quantization for nearest neighbor search.IEEE Trans

Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128, 2011

work page 2011

[21] [21]

Kankanhalli, and Anthony K

Yifan Lei, Qiang Huang, Mohan S. Kankanhalli, and Anthony K. H. Tung. Locality-sensitive hashing scheme based on longest circular co-substring. InProceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pages 2589–2599. ACM, 2020

work page 2020

[22] [22]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Proce...

work page 2020

[23] [23]

Concise formulas for the area and volume of a hyperspherical cap.Asian Journal of Mathematics & Statistics, 4(1):66–70, 2010

Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap.Asian Journal of Mathematics & Statistics, 4(1):66–70, 2010

work page 2010

[24] [24]

Malkov and Dmitry A

Yury A. Malkov and Dmitry A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836, 2020

work page 2020

[25] [25]

Fast and exact nearest neighbor search in hamming space on full-text search engines

Cun Matthew Mu, Jun Raymond Zhao, Guang Yang, Binwei Yang, and Zheng John Yan. Fast and exact nearest neighbor search in hamming space on full-text search engines. InSimilarity Search and Applications - 12th International Conference, SISAP 2019, Newark, NJ, USA, October 2-4, 2019, Proceedings, volume 11807 ofLecture Notes in Computer Science, pages 49–56....

work page 2019

[26] [26]

Graph-based nearest neighbor search: From practice to theory

Liudmila Prokhorenkova and Aleksandr Shekhovtsov. Graph-based nearest neighbor search: From practice to theory. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 ofProceedings of Machine Learning Research, pages 7803–7813. PMLR, 2020

work page 2020

[27] [27]

M. M. Mahabubur Rahman and Jelena Tesic. Evaluating hybrid approximate nearest neighbor indexing and search (HANNIS) for high-dimensional image feature search. InIEEE Interna- tional Conference on Big Data, Big Data 2022, Osaka, Japan, December 17-20, 2022, pages 6802–6804. IEEE, 2022

work page 2022

[28] [28]

ESPN: memory-efficient multi-vector information retrieval

Susav Shrestha, Narasimha Reddy, and Zongwang Li. ESPN: memory-efficient multi-vector information retrieval. InProceedings of the 2024 ACM SIGPLAN International Symposium on Memory Management, ISMM 2024, Copenhagen, Denmark, 25 June 2024, pages 95–107. ACM, 2024

work page 2024

[29] [29]

Distance distributions in finite uniformly random networks: Theory and applications.IEEE Trans

Sunil Srinivasa and Martin Haenggi. Distance distributions in finite uniformly random networks: Theory and applications.IEEE Trans. Veh. Technol., 59(2):940–949, 2010

work page 2010

[30] [30]

Diskann: Fast accurate billion-point nearest neighbor search on a single node

Suhas Jayaram Subramanya, Devvrit Fnu, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. Diskann: Fast accurate billion-point nearest neighbor search on a single node. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associ...

work page 2019

[31] [31]

Annexml: Approximate nearest neighbor search for extreme multi-label clas- sification

Yukihiro Tagami. Annexml: Approximate nearest neighbor search for extreme multi-label clas- sification. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pages 455–464. ACM, 2017

work page 2017

[32] [32]

Quality and efficiency in high dimensional nearest neighbor search

Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. Quality and efficiency in high dimensional nearest neighbor search. InProceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, pages 563–576. ACM, 2009

work page 2009

[33] [33]

An efficient and robust framework for approximate nearest neighbor search with attribute constraint

Mengzhao Wang, Lingwei Lv, Xiaoliang Xu, Yuxiang Wang, Qiang Yue, and Jiongkang Ni. An efficient and robust framework for approximate nearest neighbor search with attribute constraint. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...

work page 2023

[34] [34]

DET-LSH: A locality-sensitive hashing scheme with dynamic encoding tree for approximate nearest neighbor search.Proc

Jiuqi Wei, Botao Peng, Xiaodong Lee, and Themis Palpanas. DET-LSH: A locality-sensitive hashing scheme with dynamic encoding tree for approximate nearest neighbor search.Proc. VLDB Endow., 17(9):2241–2254, 2024

work page 2024

[35] [35]

Differential equations for random processes and random graphs.The annals of applied probability, pages 1217–1235, 1995

Nicholas Wormald. Differential equations for random processes and random graphs.The annals of applied probability, pages 1217–1235, 1995

work page 1995

[36] [36]

Routing-guided learned product quantization for graph-based approximate nearest neighbor search

Qiang Yue, Xiaoliang Xu, Yuxiang Wang, Yikun Tao, and Xuliyuan Luo. Routing-guided learned product quantization for graph-based approximate nearest neighbor search. In40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, The Netherlands, May 13-16, 2024, pages 4870–4883. IEEE, 2024

work page 2024

[37] [37]

Serf: Segment graph for range-filtering approximate nearest neighbor search.Proc

Chaoji Zuo, Miao Qiao, Wenchao Zhou, Feifei Li, and Dong Deng. Serf: Segment graph for range-filtering approximate nearest neighbor search.Proc. ACM Manag. Data, 2(1):69:1–69:26, 2024. 12 A Supplementary Lemmas We first introduce a commonly used lemma in probability theory, which determines the probability of the limsup of events by analyzing the converge...

work page 2024

[38] [38]

as a powerful tool for analyzing the discrete-time randomized graph processes and algorithms. Given a discrete-time stochastic process, in particular, consider a submartingale, if the one-step differences are bounded and the expected differences are well-approximated by a Lipschitz function, it follows that the process closely aligns with the correspondin...

work page

[39] [39]

Thus, the pruning probability is bounded below by 1 3 in the two-dimensional setting. C.2 Conservative Probability Estimation Applicability of Pruning Probability to General Points.Our probability calculations are specifically designed for the scenarios where the reference point is located at the center of the sphere, ensuring uniform density in all direc...

work page