The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search
Pith reviewed 2026-05-24 20:16 UTC · model grok-4.3
The pith
Local intrinsic dimensionality selects query sets spanning wide difficulty ranges in real-world nearest neighbor search benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that local intrinsic dimensionality distributions allow curation of query sets across a broad difficulty spectrum on real data, that different such distributions produce measurable differences in search running times, and that common benchmark datasets exhibit low diversity because performance outcomes transfer reliably between them.
What carries the argument
Local intrinsic dimensionality (LID) as a scalar measure of local data complexity around each query point, used both to select queries and to visualize performance behavior.
If this is right
- Different LID distributions of queries produce observable differences in running-time performance across nearest neighbor search implementations.
- Visualization methods based on LID give a finer view of how search algorithms behave on individual queries.
- Results obtained on any single real-world dataset transfer well to other such datasets.
Where Pith is reading between the lines
- Benchmark suites could adopt LID-based query selection as a standard way to ensure coverage of easy and hard cases.
- The observed lack of diversity suggests that adding more similar real datasets may not increase the range of tested conditions.
- Synthetic data generators tuned to produce specific LID profiles might complement existing benchmarks for more targeted testing.
Load-bearing premise
Local intrinsic dimensionality accurately reflects the computational difficulty of nearest neighbor queries.
What would settle it
Finding no consistent difference in search running times when query sets are deliberately chosen from different LID ranges across multiple datasets and algorithms.
Figures
read the original abstract
This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that local intrinsic dimensionality (LID) enables selection of query sets spanning a wide range of difficulties for nearest neighbor search on real-world datasets; that varying LID distributions affect the running-time performance of NN implementations (studied via new visualizations of algorithm internals); and that commonly used real-world datasets lack diversity because performance results on any single dataset predict results on the others well.
Significance. If the empirical claims hold after proper controls and documentation, the work could improve NN-search benchmarking by supplying a controllable proxy (LID) for query difficulty and by cautioning the community against treating a small collection of datasets as representative. The introduced visualizations might also aid diagnosis of algorithm behavior beyond aggregate recall or time metrics.
major comments (3)
- [Abstract / query-set selection section] Abstract and the section describing the LID-based query-set construction: the central claim that LID distributions permit selection of query sets with a wide difficulty range (and that these distributions affect running times) requires evidence that LID is the dominant driver. No controls or comparisons are described against other per-query factors such as the empirical distribution of distances to the k nearest neighbors, local density, or cluster structure; without such controls the proxy assumption remains untested.
- [Closing remarks on dataset diversity] The diversity claim (results on a single dataset predict results on all others): the manuscript must specify the exact datasets, the performance metrics used for the prediction test, the quantitative measure of predictive power (e.g., correlation or rank agreement), and any statistical significance assessment. The current statement is too coarse to evaluate whether the datasets truly share unmeasured properties that LID does not isolate.
- [Empirical study section] Empirical study of LID effects on running time: the description supplies no information on the number of runs, variance estimation, hardware controls, or error bars, making it impossible to judge whether observed differences are attributable to LID rather than implementation or measurement artifacts.
minor comments (2)
- [Introduction] Notation for LID and the precise definition of “difficulty” should be introduced earlier and used consistently.
- [Visualization section] Figure captions for the new visualization concepts should explicitly state what each panel shows and how it relates to the LID-controlled query sets.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments identify areas where additional rigor and detail will strengthen the manuscript. We respond to each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [Abstract / query-set selection section] Abstract and the section describing the LID-based query-set construction: the central claim that LID distributions permit selection of query sets with a wide difficulty range (and that these distributions affect running times) requires evidence that LID is the dominant driver. No controls or comparisons are described against other per-query factors such as the empirical distribution of distances to the k nearest neighbors, local density, or cluster structure; without such controls the proxy assumption remains untested.
Authors: We acknowledge that the manuscript does not include explicit controls comparing LID against other per-query factors. The central contribution is the introduction of LID as a practical, controllable proxy together with visualizations that illustrate its effect on search behavior. We will revise the relevant sections to discuss potential confounding factors (distance distributions, local density, cluster structure) and to state explicitly that the current evidence is correlational rather than a controlled isolation of LID. This will be a textual clarification and limitation statement; no new experiments are planned at this stage. revision: partial
-
Referee: [Closing remarks on dataset diversity] The diversity claim (results on a single dataset predict results on all others): the manuscript must specify the exact datasets, the performance metrics used for the prediction test, the quantitative measure of predictive power (e.g., correlation or rank agreement), and any statistical significance assessment. The current statement is too coarse to evaluate whether the datasets truly share unmeasured properties that LID does not isolate.
Authors: We agree the claim is stated too coarsely. The experiments use the standard ANN-Benchmarks collection (SIFT, GloVe, MNIST, NYTimes, and similar). Performance is measured by recall@k and query time; predictive power is assessed via rank correlation of algorithm orderings across datasets. We will add the exact dataset list, the precise metrics, the correlation values obtained, and a brief note on statistical significance. These additions will make the diversity observation precise and reproducible. revision: yes
-
Referee: [Empirical study section] Empirical study of LID effects on running time: the description supplies no information on the number of runs, variance estimation, hardware controls, or error bars, making it impossible to judge whether observed differences are attributable to LID rather than implementation or measurement artifacts.
Authors: We accept this criticism on reproducibility. The experimental section will be expanded to report the number of independent runs per configuration, the hardware platform (CPU model, memory), the method used for variance estimation, and error bars on the time and recall plots. These details were present in our internal logs but omitted from the submitted text. revision: yes
Circularity Check
No circularity; claims are empirical observations from data
full rationale
The paper's core results—that LID distributions can be used to select query sets spanning difficulty ranges and that results on one real-world dataset predict others—are presented as direct empirical findings from measurements and visualizations on multiple datasets. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the abstract or described claims; the derivation chain consists of data-driven comparisons rather than reductions to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Alman, J., Williams, R.: Probabilistic polynomials and hamming nearest neighbors. In: FOCS’15. pp. 136–150
-
[2]
Amsaleg, L., Chelly, O., Furon, T., Girard, S., Houle, M.E., Kawarabayashi, K.I., Nett, M.: Estimating local intrinsic dimensionality. In: KDD’15. pp. 29–38. ACM (2015)
work page 2015
-
[3]
In: Proceedings of the 2019 SIAM International Conference on Data Mining
Amsaleg, L., Chelly, O., Houle, M.E., Kawarabayashi, K.i., Radovanovi´ c, M., Treeratanajaru, W.: Intrinsic dimensionality estimation within tight localities. In: Proceedings of the 2019 SIAM International Conference on Data Mining. pp. 181–189. SIAM (2019)
work page 2019
-
[4]
5 Mayank Bawa, Tyson Condie, and Prasanna Ganesan
Aum¨ uller, M., Bernhardsson, E., Faithfull, A.: ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. In: SISAP’17. pp. 34–49 (2017). https://doi.org/10.1007/978-3-319-68474-1 3
-
[5]
Aum¨ uller, M., Ceccarello, M.: Benchmarking nearest neighbor search: Influence of local intrinsic dimensionality and result diversity in real-world datasets. In: 1st Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning (EDML 2019) (2019), https://imada.sdu.dk/Research/EDML/
work page 2019
-
[6]
Bernhardsson, E.: Annoy, https://github.com/spotify/annoy
-
[7]
Casanova, G., Englmeier, E., Houle, M.E., Kr¨ oger, P., Nett, M., Schubert, E., Zimek, A.: Dimensional testing for reverse k-nearest neighbor search. PVLDB 10(7), 769–780 (2017)
work page 2017
-
[8]
Ch´ avez, E., Navarro, G., Baeza-Yates, R., Marroqu´ ın, J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (Sep 2001). https://doi.org/10.1145/502807.502808 14 Martin Aum¨ uller and Matteo Ceccarello
-
[9]
Curtin, R.R., Cline, J.R., Slagle, N.P., March, W.B., Ram, P., Mehta, N.A., Gray, A.G.: MLPACK: A scalable C++ machine learning library. J. of Machine Learning Research 14, 801–805 (2013)
work page 2013
-
[10]
In: NIPS 2014 Workshop on Software Engineering for Machine Learning (2014)
Edel, M., Soni, A., Curtin, R.R.: An automatic benchmarking system. In: NIPS 2014 Workshop on Software Engineering for Machine Learning (2014)
work page 2014
-
[11]
In: Data Mining Workshops (ICDMW)
Houle, M.E.: Dimensionality, discriminability, density and distance distributions. In: Data Mining Workshops (ICDMW). pp. 468–473. IEEE (2013)
work page 2013
-
[12]
Houle, M.E., Schubert, E., Zimek, A.: On the correlation between local intrinsic dimensionality and outlierness. In: SISAP’18. pp. 177–191 (2018). https://doi.org/10.1007/978-3-030-02224-2 14
-
[13]
Iwasaki, M., Miyazaki, D.: Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Data. ArXiv e-prints (Oct 2018)
work page 2018
-
[14]
J´ egou, H., Douze, M., Schmid, C.: Product quantization for nearest neigh- bor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011). https://doi.org/10.1109/TPAMI.2010.57
-
[15]
Billion-scale similarity search with GPUs
Johnson, J., Douze, M., J´ egou, H.: Billion-scale similarity search with gpus. CoRR abs/1702.08734 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
Israel Journal of Mathematics 54(2), 129–138 (1986)
Johnson, W.B., Lindenstrauss, J., Schechtman, G.: Extensions of Lipschitz maps into Banach spaces. Israel Journal of Mathematics 54(2), 129–138 (1986)
work page 1986
- [17]
-
[18]
Kriegel, H., Schubert, E., Zimek, A.: The (black) art of runtime evaluation: Are we comparing algorithms or implementations? Knowl. Inf. Syst. 52(2), 341–378 (2017)
work page 2017
-
[19]
Levina, E., Bickel, P.J.: Maximum likelihood estimation of intrinsic dimension. In: NIPS’15. pp. 777–784 (2005)
work page 2005
-
[20]
Li, W., Zhang, Y., Sun, Y., Wang, W., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement (v1.0). CoRR abs/1610.02455 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[21]
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. ArXiv e-prints (Mar 2016)
work page 2016
-
[22]
Efficient Estimation of Word Representations in Vector Space
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. CoRR abs/1301.3781 (2013)
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[23]
In: Empirical Methods in Natural Language Processing (EMNLP)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre- sentation. In: Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543 (2014)
work page 2014
-
[24]
Computers & Operations Research 45, 12 – 24 (2014)
Smith-Miles, K., Baatar, D., Wreford, B., Lewis, R.: Towards objective measures of algorithm performance across instance space. Computers & Operations Research 45, 12 – 24 (2014)
work page 2014
-
[25]
Spring, R., Shrivastava, A.: Scalable and sustainable deep learning via randomized hashing. In: KDD’17. pp. 445–454 (2017). https://doi.org/10.1145/3097983.3098035
-
[26]
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for bench- marking machine learning algorithms. CoRR abs/1708.07747 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.