pith. sign in

arxiv: 1907.07387 · v1 · pith:6RAADYBVnew · submitted 2019-07-17 · 💻 cs.IR · cs.DB

The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search

Pith reviewed 2026-05-24 20:16 UTC · model grok-4.3

classification 💻 cs.IR cs.DB
keywords nearest neighbor searchlocal intrinsic dimensionalitybenchmarkingquery difficultydataset diversityperformance evaluationquery sets
0
0 comments X

The pith

Local intrinsic dimensionality selects query sets spanning wide difficulty ranges in real-world nearest neighbor search benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that local intrinsic dimensionality can be used to pick queries of many different difficulty levels from the same real datasets. This makes it possible to study how nearest neighbor search running times change with query hardness in a controlled way. The work also finds that standard real-world datasets are not diverse, because running-time results on any one of them predict results on the others quite well. A reader cares because current benchmarks may give an incomplete picture of how algorithms behave across varying conditions.

Core claim

The authors establish that local intrinsic dimensionality distributions allow curation of query sets across a broad difficulty spectrum on real data, that different such distributions produce measurable differences in search running times, and that common benchmark datasets exhibit low diversity because performance outcomes transfer reliably between them.

What carries the argument

Local intrinsic dimensionality (LID) as a scalar measure of local data complexity around each query point, used both to select queries and to visualize performance behavior.

If this is right

  • Different LID distributions of queries produce observable differences in running-time performance across nearest neighbor search implementations.
  • Visualization methods based on LID give a finer view of how search algorithms behave on individual queries.
  • Results obtained on any single real-world dataset transfer well to other such datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmark suites could adopt LID-based query selection as a standard way to ensure coverage of easy and hard cases.
  • The observed lack of diversity suggests that adding more similar real datasets may not increase the range of tested conditions.
  • Synthetic data generators tuned to produce specific LID profiles might complement existing benchmarks for more targeted testing.

Load-bearing premise

Local intrinsic dimensionality accurately reflects the computational difficulty of nearest neighbor queries.

What would settle it

Finding no consistent difference in search running times when query sets are deliberately chosen from different LID ranges across multiple datasets and algorithms.

Figures

Figures reproduced from arXiv: 1907.07387 by Martin Aum\"uller, Matteo Ceccarello.

Figure 1
Figure 1. Figure 1: LID distribution for each dataset. Ticks below the distribu￾tion curves represent single queries. Lines within each distribution curve correspond to the 25, 50 and 75 percentile. The red line marks the 10 000 largest estimated LID, which we use as a threshold value to define hard query sets. nearest neighbor search algorithms. Fashion-MNIST is considered as a replacement for MNIST, which is usually conside… view at source ↗
Figure 2
Figure 2. Figure 2: Recall-QPS (1/s) tradeoff – up and to the right is better; top: SIFT, bottom: GLOVE-2M; left: easy, middle: middle, right: hard. Objectives of the Experiments Our experiments are tailored to answer the follow￾ing questions: (Q1) How does the LID of a query set influence the running time performance? (Section 5.1) (Q2) How diverse are measurements obtained on datasets? Do relative differences between the pe… view at source ↗
Figure 3
Figure 3. Figure 3: Recall-QPS (1/s) tradeoff – up and to the right is better; left: ONNG, right: Annoy; (E) — easy, (M) — medium, (H) — hard, (D) — diverse. 5.1 Influence of LID on Performance [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ranking of algorithms on five different datasets, according to recall ≥ 0.75 and ≥ 0.9, and according to two different performance measures: number of queries per second (left) and number of distance computations (right). Both plots report the ratio with the best performing algorithm on each dataset: for the queries per second metric a larger ratio is better, for the number of distance computations metric,… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of performance for queries on the GLOVE-2M (medium difficulty) dataset. Looking just at the average performance can hide interesting behaviour [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of Recall vs. LID plot on the GLOVE datset. Intensity reflects number of queries that achieve a combination of recall vs. LID. For the bottom plots, we observe that individual query times of all the algorithms are well concentrated around their mean [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that local intrinsic dimensionality (LID) enables selection of query sets spanning a wide range of difficulties for nearest neighbor search on real-world datasets; that varying LID distributions affect the running-time performance of NN implementations (studied via new visualizations of algorithm internals); and that commonly used real-world datasets lack diversity because performance results on any single dataset predict results on the others well.

Significance. If the empirical claims hold after proper controls and documentation, the work could improve NN-search benchmarking by supplying a controllable proxy (LID) for query difficulty and by cautioning the community against treating a small collection of datasets as representative. The introduced visualizations might also aid diagnosis of algorithm behavior beyond aggregate recall or time metrics.

major comments (3)
  1. [Abstract / query-set selection section] Abstract and the section describing the LID-based query-set construction: the central claim that LID distributions permit selection of query sets with a wide difficulty range (and that these distributions affect running times) requires evidence that LID is the dominant driver. No controls or comparisons are described against other per-query factors such as the empirical distribution of distances to the k nearest neighbors, local density, or cluster structure; without such controls the proxy assumption remains untested.
  2. [Closing remarks on dataset diversity] The diversity claim (results on a single dataset predict results on all others): the manuscript must specify the exact datasets, the performance metrics used for the prediction test, the quantitative measure of predictive power (e.g., correlation or rank agreement), and any statistical significance assessment. The current statement is too coarse to evaluate whether the datasets truly share unmeasured properties that LID does not isolate.
  3. [Empirical study section] Empirical study of LID effects on running time: the description supplies no information on the number of runs, variance estimation, hardware controls, or error bars, making it impossible to judge whether observed differences are attributable to LID rather than implementation or measurement artifacts.
minor comments (2)
  1. [Introduction] Notation for LID and the precise definition of “difficulty” should be introduced earlier and used consistently.
  2. [Visualization section] Figure captions for the new visualization concepts should explicitly state what each panel shows and how it relates to the LID-controlled query sets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify areas where additional rigor and detail will strengthen the manuscript. We respond to each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [Abstract / query-set selection section] Abstract and the section describing the LID-based query-set construction: the central claim that LID distributions permit selection of query sets with a wide difficulty range (and that these distributions affect running times) requires evidence that LID is the dominant driver. No controls or comparisons are described against other per-query factors such as the empirical distribution of distances to the k nearest neighbors, local density, or cluster structure; without such controls the proxy assumption remains untested.

    Authors: We acknowledge that the manuscript does not include explicit controls comparing LID against other per-query factors. The central contribution is the introduction of LID as a practical, controllable proxy together with visualizations that illustrate its effect on search behavior. We will revise the relevant sections to discuss potential confounding factors (distance distributions, local density, cluster structure) and to state explicitly that the current evidence is correlational rather than a controlled isolation of LID. This will be a textual clarification and limitation statement; no new experiments are planned at this stage. revision: partial

  2. Referee: [Closing remarks on dataset diversity] The diversity claim (results on a single dataset predict results on all others): the manuscript must specify the exact datasets, the performance metrics used for the prediction test, the quantitative measure of predictive power (e.g., correlation or rank agreement), and any statistical significance assessment. The current statement is too coarse to evaluate whether the datasets truly share unmeasured properties that LID does not isolate.

    Authors: We agree the claim is stated too coarsely. The experiments use the standard ANN-Benchmarks collection (SIFT, GloVe, MNIST, NYTimes, and similar). Performance is measured by recall@k and query time; predictive power is assessed via rank correlation of algorithm orderings across datasets. We will add the exact dataset list, the precise metrics, the correlation values obtained, and a brief note on statistical significance. These additions will make the diversity observation precise and reproducible. revision: yes

  3. Referee: [Empirical study section] Empirical study of LID effects on running time: the description supplies no information on the number of runs, variance estimation, hardware controls, or error bars, making it impossible to judge whether observed differences are attributable to LID rather than implementation or measurement artifacts.

    Authors: We accept this criticism on reproducibility. The experimental section will be expanded to report the number of independent runs per configuration, the hardware platform (CPU model, memory), the method used for variance estimation, and error bars on the time and recall plots. These details were present in our internal logs but omitted from the submitted text. revision: yes

Circularity Check

0 steps flagged

No circularity; claims are empirical observations from data

full rationale

The paper's core results—that LID distributions can be used to select query sets spanning difficulty ranges and that results on one real-world dataset predict others—are presented as direct empirical findings from measurements and visualizations on multiple datasets. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the abstract or described claims; the derivation chain consists of data-driven comparisons rather than reductions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; no details on any fitted values or background assumptions beyond the use of LID as a difficulty measure.

pith-pipeline@v0.9.0 · 5636 in / 1025 out tokens · 25355 ms · 2026-05-24T20:16:24.811188+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

  1. [1]

    In: FOCS’15

    Alman, J., Williams, R.: Probabilistic polynomials and hamming nearest neighbors. In: FOCS’15. pp. 136–150

  2. [2]

    In: KDD’15

    Amsaleg, L., Chelly, O., Furon, T., Girard, S., Houle, M.E., Kawarabayashi, K.I., Nett, M.: Estimating local intrinsic dimensionality. In: KDD’15. pp. 29–38. ACM (2015)

  3. [3]

    In: Proceedings of the 2019 SIAM International Conference on Data Mining

    Amsaleg, L., Chelly, O., Houle, M.E., Kawarabayashi, K.i., Radovanovi´ c, M., Treeratanajaru, W.: Intrinsic dimensionality estimation within tight localities. In: Proceedings of the 2019 SIAM International Conference on Data Mining. pp. 181–189. SIAM (2019)

  4. [4]

    5 Mayank Bawa, Tyson Condie, and Prasanna Ganesan

    Aum¨ uller, M., Bernhardsson, E., Faithfull, A.: ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. In: SISAP’17. pp. 34–49 (2017). https://doi.org/10.1007/978-3-319-68474-1 3

  5. [5]

    In: 1st Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning (EDML 2019) (2019), https://imada.sdu.dk/Research/EDML/

    Aum¨ uller, M., Ceccarello, M.: Benchmarking nearest neighbor search: Influence of local intrinsic dimensionality and result diversity in real-world datasets. In: 1st Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning (EDML 2019) (2019), https://imada.sdu.dk/Research/EDML/

  6. [6]

    Bernhardsson, E.: Annoy, https://github.com/spotify/annoy

  7. [7]

    PVLDB 10(7), 769–780 (2017)

    Casanova, G., Englmeier, E., Houle, M.E., Kr¨ oger, P., Nett, M., Schubert, E., Zimek, A.: Dimensional testing for reverse k-nearest neighbor search. PVLDB 10(7), 769–780 (2017)

  8. [8]

    ACM Comput

    Ch´ avez, E., Navarro, G., Baeza-Yates, R., Marroqu´ ın, J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (Sep 2001). https://doi.org/10.1145/502807.502808 14 Martin Aum¨ uller and Matteo Ceccarello

  9. [9]

    Curtin, R.R., Cline, J.R., Slagle, N.P., March, W.B., Ram, P., Mehta, N.A., Gray, A.G.: MLPACK: A scalable C++ machine learning library. J. of Machine Learning Research 14, 801–805 (2013)

  10. [10]

    In: NIPS 2014 Workshop on Software Engineering for Machine Learning (2014)

    Edel, M., Soni, A., Curtin, R.R.: An automatic benchmarking system. In: NIPS 2014 Workshop on Software Engineering for Machine Learning (2014)

  11. [11]

    In: Data Mining Workshops (ICDMW)

    Houle, M.E.: Dimensionality, discriminability, density and distance distributions. In: Data Mining Workshops (ICDMW). pp. 468–473. IEEE (2013)

  12. [12]

    In: SISAP’18

    Houle, M.E., Schubert, E., Zimek, A.: On the correlation between local intrinsic dimensionality and outlierness. In: SISAP’18. pp. 177–191 (2018). https://doi.org/10.1007/978-3-030-02224-2 14

  13. [13]

    ArXiv e-prints (Oct 2018)

    Iwasaki, M., Miyazaki, D.: Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Data. ArXiv e-prints (Oct 2018)

  14. [14]

    IEEE Trans

    J´ egou, H., Douze, M., Schmid, C.: Product quantization for nearest neigh- bor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011). https://doi.org/10.1109/TPAMI.2010.57

  15. [15]

    Billion-scale similarity search with GPUs

    Johnson, J., Douze, M., J´ egou, H.: Billion-scale similarity search with gpus. CoRR abs/1702.08734 (2017)

  16. [16]

    Israel Journal of Mathematics 54(2), 129–138 (1986)

    Johnson, W.B., Lindenstrauss, J., Schechtman, G.: Extensions of Lipschitz maps into Banach spaces. Israel Journal of Mathematics 54(2), 129–138 (1986)

  17. [17]

    Springer (2011)

    Jolliffe, I.: Principal component analysis. Springer (2011)

  18. [18]

    Kriegel, H., Schubert, E., Zimek, A.: The (black) art of runtime evaluation: Are we comparing algorithms or implementations? Knowl. Inf. Syst. 52(2), 341–378 (2017)

  19. [19]

    In: NIPS’15

    Levina, E., Bickel, P.J.: Maximum likelihood estimation of intrinsic dimension. In: NIPS’15. pp. 777–784 (2005)

  20. [20]

    Approximate Nearest Neighbor Search on High Dimensional Data --- Experiments, Analyses, and Improvement (v1.0)

    Li, W., Zhang, Y., Sun, Y., Wang, W., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement (v1.0). CoRR abs/1610.02455 (2016)

  21. [21]

    ArXiv e-prints (Mar 2016)

    Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. ArXiv e-prints (Mar 2016)

  22. [22]

    Efficient Estimation of Word Representations in Vector Space

    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. CoRR abs/1301.3781 (2013)

  23. [23]

    In: Empirical Methods in Natural Language Processing (EMNLP)

    Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre- sentation. In: Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543 (2014)

  24. [24]

    Computers & Operations Research 45, 12 – 24 (2014)

    Smith-Miles, K., Baatar, D., Wreford, B., Lewis, R.: Towards objective measures of algorithm performance across instance space. Computers & Operations Research 45, 12 – 24 (2014)

  25. [25]

    In: KDD’17

    Spring, R., Shrivastava, A.: Scalable and sustainable deep learning via randomized hashing. In: KDD’17. pp. 445–454 (2017). https://doi.org/10.1145/3097983.3098035

  26. [26]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for bench- marking machine learning algorithms. CoRR abs/1708.07747 (2017)