pith. sign in

arxiv: 2601.01930 · v3 · submitted 2026-01-05 · 💻 cs.IR · cs.AI

MCGI: Manifold-Consistent Graph Indexing for Billion-Scale Disk-Resident Vector Search

Pith reviewed 2026-05-16 18:15 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords manifold-consistent graph indexinglocal intrinsic dimensionalityapproximate nearest neighbor searchdisk-resident vector searchbeam search adaptationbillion-scale indexinggraph-based ANN
0
0 comments X

The pith

MCGI adapts beam budgets using local intrinsic dimensionality to preserve manifold connectivity in graph-based vector search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Manifold-Consistent Graph Indexing to fix the mismatch between Euclidean distances and geodesic paths on data manifolds during approximate nearest neighbor search. Standard graph methods use fixed beam widths that cause routing errors in high dimensions because they ignore local geometry. MCGI computes in-situ local intrinsic dimensionality estimates to adjust the beam search range dynamically, replacing a single hyperparameter with a stable geometry-informed interval. This adjustment is claimed to maintain topological connectivity that aligns with the underlying manifold structure. The method targets disk-resident indexes at billion-scale without increasing sensitivity to dataset-specific tuning.

Core claim

MCGI provides robust approximation by preserving manifold-consistent topological connectivity through dynamic modulation of beam search budgets informed by in-situ local intrinsic dimensionality estimates, which reduces divergence from the data manifold compared to uniform treatment of dimensions.

What carries the argument

Local intrinsic dimensionality modulated beam search, which replaces a fixed scalar budget with a geometry-aware range computed during search to adapt to local manifold properties.

If this is right

  • Search performance becomes more stable across datasets with different dimensionalities and scales.
  • Manifold connectivity preservation leads to fewer routing failures in high-dimensional spaces.
  • Disk-resident graphs can support billion-scale queries with lower hyperparameter tuning effort.
  • The approach generalizes beyond specific datasets by using local geometry rather than global assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same LID modulation principle could be tested in memory-resident graph indexes to isolate disk effects.
  • Combining MCGI with curvature-aware sampling might improve connectivity preservation on manifolds with varying local properties.
  • Direct measurement of geodesic versus Euclidean path lengths on synthetic manifolds would provide a targeted test of the connectivity claim.

Load-bearing premise

In-situ local intrinsic dimensionality estimates can be computed accurately enough during search to modulate beam budgets without introducing new errors or excessive overhead.

What would settle it

Measure whether approximation error rises when the LID-based modulation is disabled and replaced by fixed beam widths in controlled tests on datasets with known manifold structure.

Figures

Figures reproduced from arXiv: 2601.01930 by Dongfang Zhao.

Figure 1
Figure 1. Figure 1: Recall-QPS Trade-off. Comparison of MCGI against DiskANN and Faiss (mmap). [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Billion scale scalability and resource efficiency evaluations. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance on T2I-1B. during query execution to capture the Recall-QPS trade-off. Threads specifies the number of parallel threads utilized during search to maximize hardware concurrency [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
read the original abstract

Graph-based Approximate Nearest Neighbor (ANN) search often suffers from performance degradation in high-dimensional spaces due to the Euclidean-Geodesic mismatch, where greedy routing diverges from the underlying data manifold. To address this challenge, this paper presents Manifold-Consistent Graph Indexing (MCGI), a geometry-aware and disk-resident indexing method that leverages Local Intrinsic Dimensionality (LID) to dynamically adapt search strategies to the intrinsic geometry of data. Unlike conventional algorithms that treat dimensions uniformly, MCGI modulates its beam search budget based on in-situ geometric analysis, which reduces sensitivity to data-specific hyperparameters by replacing a single scalar with a geometry-informed range that remains stable across datasets of varying dimensionality. Theoretical analysis demonstrates that MCGI provides robust approximation by preserving manifold-consistent topological connectivity. Extensive evaluations against five industry-standard baselines across five datasets up to billion scales confirm the advantages of the proposed approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Manifold-Consistent Graph Indexing (MCGI), a disk-resident graph index for billion-scale vector search that uses in-situ Local Intrinsic Dimensionality (LID) estimates to modulate per-query beam budgets. This is intended to mitigate the Euclidean-geodesic mismatch by preserving manifold-consistent topological connectivity during greedy routing, replacing a single scalar hyperparameter with a geometry-informed range claimed to be more stable across datasets. Theoretical analysis is asserted to demonstrate robust approximation guarantees, and empirical results are reported against five baselines on five datasets reaching billion scale.

Significance. If the unstated error bounds on in-situ LID estimation hold and the modulation preserves connectivity without prohibitive overhead, the approach could reduce hyperparameter sensitivity in high-dimensional graph ANN and improve robustness for disk-resident indexes. The scale of the claimed evaluations is a positive factor, but significance is limited by the absence of explicit derivations or overhead analysis for the LID component.

major comments (3)
  1. [Theoretical Analysis] Theoretical Analysis section: the assertion that MCGI 'provides robust approximation by preserving manifold-consistent topological connectivity' is load-bearing for the central claim, yet no derivation, error bound, or distortion analysis is supplied for the in-situ LID estimates used to set beam budgets. Without these, the guarantee reduces to an unverified assumption that LID computation introduces negligible error relative to manifold curvature.
  2. [Method] Method description (beam-modulation paragraph): the replacement of a single scalar hyperparameter by a 'geometry-informed range' is presented as reducing sensitivity, but the paper does not demonstrate that determining or applying this range is itself parameter-free or that its stability across dimensionalities has been proven rather than observed post-hoc.
  3. [Experimental Evaluation] Experimental Evaluation section: claims of advantage on billion-scale datasets rest on comparisons whose statistical reliability cannot be assessed because error bars, number of runs, and exact tuning protocols for both MCGI and the five baselines are not reported.
minor comments (2)
  1. [Abstract] Abstract and introduction: the phrase 'replaces a single scalar with a geometry-informed range' is repeated without a concrete definition or pseudocode showing how the range is computed from LID values.
  2. [Figures] Figure captions and tables: axis labels and legend entries for the billion-scale timing and recall plots should explicitly state the beam-budget modulation formula used in each curve.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the theoretical foundations, clarify the method, and improve the experimental reporting.

read point-by-point responses
  1. Referee: [Theoretical Analysis] Theoretical Analysis section: the assertion that MCGI 'provides robust approximation by preserving manifold-consistent topological connectivity' is load-bearing for the central claim, yet no derivation, error bound, or distortion analysis is supplied for the in-situ LID estimates used to set beam budgets. Without these, the guarantee reduces to an unverified assumption that LID computation introduces negligible error relative to manifold curvature.

    Authors: We agree that the current presentation of the theoretical analysis is high-level and would benefit from explicit derivations. In the revised manuscript, we will expand the Theoretical Analysis section to include: (i) a derivation of the approximation guarantee based on the preservation of manifold-consistent edges under LID-modulated beam search, (ii) error bounds on in-situ LID estimates derived from concentration results for local dimensionality estimators, and (iii) a distortion analysis quantifying the maximum deviation from geodesic distances introduced by the modulation. These additions will replace the implicit assumption of negligible error with quantitative bounds. revision: yes

  2. Referee: [Method] Method description (beam-modulation paragraph): the replacement of a single scalar hyperparameter by a 'geometry-informed range' is presented as reducing sensitivity, but the paper does not demonstrate that determining or applying this range is itself parameter-free or that its stability across dimensionalities has been proven rather than observed post-hoc.

    Authors: The geometry-informed range is obtained directly from the per-point LID values computed during index construction and query processing, without introducing additional tunable scalars beyond the base graph parameters. While the cross-dataset stability was demonstrated empirically, we acknowledge that a formal proof of invariance across dimensionalities would be stronger. In the revision we will add a dedicated paragraph clarifying the parameter-free computation of the range and include a brief theoretical argument (leveraging the scale-invariance properties of LID) together with expanded empirical validation across a wider range of dimensionalities. revision: partial

  3. Referee: [Experimental Evaluation] Experimental Evaluation section: claims of advantage on billion-scale datasets rest on comparisons whose statistical reliability cannot be assessed because error bars, number of runs, and exact tuning protocols for both MCGI and the five baselines are not reported.

    Authors: We fully agree that statistical details are required for assessing reliability. In the revised manuscript we will report: (i) results averaged over 10 independent runs with error bars showing standard deviation, (ii) the exact number of runs performed, and (iii) the complete tuning protocols, including the hyperparameter search grids and selection criteria used for MCGI and all five baselines. These additions will be placed in the Experimental Evaluation section and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical claim rests on proposed LID modulation without reducing to self-defined inputs

full rationale

The paper's strongest claim is that 'Theoretical analysis demonstrates that MCGI provides robust approximation by preserving manifold-consistent topological connectivity.' No equations, derivations, or self-citations are exhibited in the abstract or described text that reduce this preservation to a fitted parameter, renamed input, or self-citation chain. The approach modulates beam budgets via in-situ LID estimates as a new geometry-aware strategy, supported by external baselines rather than internal redefinition. No self-definitional, fitted-prediction, or uniqueness-imported patterns appear. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are introduced beyond the standard concept of Local Intrinsic Dimensionality from prior literature.

pith-pipeline@v0.9.0 · 5440 in / 971 out tokens · 36937 ms · 2026-05-16T18:15:19.419814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    Houle, Ken-ichi Kawarabayashi, and Michael Nett

    Laurent Amsaleg, Oussama Chelly, Teddy Furon, Stéphane Girard, Michael E. Houle, Ken-ichi Kawarabayashi, and Michael Nett. Estimating local intrinsic dimensionality. InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, page 29–38, New York, NY , USA, 2015. Association for Computing Machinery

  2. [2]

    Princeton University Press, 2010

    RICHARD BELLMAN and Stuart Dreyfus.Dynamic Programming, volume 33. Princeton University Press, 2010

  3. [3]

    Multidimensional binary search trees used for associative searching.Com- mun

    Jon Louis Bentley. Multidimensional binary search trees used for associative searching.Com- mun. ACM, 18(9):509–517, September 1975

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  5. [5]

    Spann: highly-efficient billion-scale approximate nearest neighbor search

    Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. Spann: highly-efficient billion-scale approximate nearest neighbor search. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY , USA, 2021. Curran Associates Inc

  6. [6]

    Random projection trees and low dimensional manifolds

    Sanjoy Dasgupta and Yoav Freund. Random projection trees and low dimensional manifolds. InProceedings of the Fortieth Annual ACM Symposium on Theory of Computing, STOC ’08, page 537–546, New York, NY , USA, 2008. Association for Computing Machinery

  7. [7]

    Mirrokni

    Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. InProceedings of the Twentieth Annual Symposium on Computational Geometry, SCG ’04, page 253–262, New York, NY , USA, 2004. Association for Computing Machinery

  8. [8]

    Navigable graphs for high-dimensional nearest neighbor search: Constructions and limits

    Haya Diwan, Jinrui Gou, Cameron Musco, Christopher Musco, and Torsten Suel. Navigable graphs for high-dimensional nearest neighbor search: Constructions and limits. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 59513–59531. Curran Associates...

  9. [9]

    Fast approximate nearest neighbor search with the navigating spreading-out graph.Proc

    Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. Fast approximate nearest neighbor search with the navigating spreading-out graph.Proc. VLDB Endow., 12(5):461–474, January 2019

  10. [10]

    Optimized product quantization

    Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization. volume 36, page 744–755, USA, April 2014. IEEE Computer Society

  11. [11]

    Gupta, R

    A. Gupta, R. Krauthgamer, and J.R. Lee. Bounded geometries, fractals, and low-distortion embeddings. In44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 534–543, 2003

  12. [12]

    R-trees: a dynamic index structure for spatial searching

    Antonin Guttman. R-trees: a dynamic index structure for spatial searching. InProceedings of the 1984 ACM SIGMOD International Conference on Management of Data, SIGMOD ’84, page 47–57, New York, NY , USA, 1984. Association for Computing Machinery

  13. [13]

    On the difficulty of nearest neighbor search

    Junfeng He, Sanjiv Kumar, and Shih-Fu Chang. On the difficulty of nearest neighbor search. InProceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, page 41–48, Madison, WI, USA, 2012. Omnipress

  14. [14]

    Michael E. Houle. Local intrinsic dimensionality I: an extreme-value-theoretic foundation for similarity applications. In Christian Beecks, Felix Borutta, Peer Kröger, and Thomas Seidl, editors,Similarity Search and Applications - 10th International Conference, SISAP 2017, Munich, Germany, October 4-6, 2017, Proceedings, volume 10609 ofLecture Notes in Co...

  15. [15]

    Approximate nearest neighbors: towards removing the curse of dimensionality

    Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. InProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, page 604–613, New York, NY , USA, 1998. Association for Computing Machinery

  16. [16]

    Worst-case performance of popular approximate nearest neighbor search implementations: Guarantees and limitations

    Piotr Indyk and Haike Xu. Worst-case performance of popular approximate nearest neighbor search implementations: Guarantees and limitations. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 66239–66256. Curran Associates, Inc., 2023

  17. [17]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12), March 2023

  18. [18]

    Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2021

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2021

  19. [19]

    Product quantization for nearest neighbor search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011

    Herve Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011

  20. [20]

    Searching in one billion vectors: Re-rank with source coding

    Hervé Jégou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg. Searching in one billion vectors: Re-rank with source coding. In2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 861–864, 2011

  21. [21]

    Karger and Matthias Ruhl

    David R. Karger and Matthias Ruhl. Finding nearest neighbors in growth-restricted metrics. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC ’02, page 741–750, New York, NY , USA, 2002. Association for Computing Machinery

  22. [22]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online,...

  23. [23]

    Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs

    Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. Lessons learned from the chameleon testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC ’20). USENIX Assoc...

  24. [24]

    Maximum likelihood estimation of intrinsic dimension

    Elizaveta Levina and Peter Bickel. Maximum likelihood estimation of intrinsic dimension. In L. Saul, Y . Weiss, and L. Bottou, editors,Advances in Neural Information Processing Systems, volume 17. MIT Press, 2004

  25. [25]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. InProceed- ings of the 34th International Conference on Neural Information Processing Systems, NIPS ’...

  26. [26]

    Erfani, Sudanthi N

    Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi N. R. Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. InInternational Conference on Learning Representations, 2018

  27. [27]

    Malkov and D

    Yu A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. volume 42, page 824–836, USA, April 2020. IEEE Computer Society

  28. [28]

    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group o...

  29. [29]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...

  30. [30]

    The probabilistic relevance framework: Bm25 and beyond

    Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond. InFoundations and Trends in Information Retrieval, volume 3, pages 333–389. Now Publishers, Inc., 2009

  31. [31]

    Roweis and Lawrence K

    Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding.Science, 290(5500):2323–2326, 2000

  32. [32]

    Results of the big ANN: NeurIPS’23 competition

    Harsha Vardhan simhadri, Martin Aumüller, Matthijs Douze, Dmitry Baranchuk, Amir Ingber, Edo Liberty, George Williams, Ben Landrum, Magdalen Dobson Manohar, Mazin Karjikar, Laxman Dhulipala, Meng Chen, Yue Chen, Rui Ma, Kai Zhang, Yuzheng Cai, Jiayang Shi, Weiguo Zheng, Yizhuo Chen, Jie Yin, and Ben Huang. Results of the big ANN: NeurIPS’23 competition. I...

  33. [33]

    Results of the neurips’21 challenge on billion- scale approximate nearest neighbor search

    Harsha Vardhan Simhadri, George Williams, Martin Aumüller, Matthijs Douze, Artem Babenko, Dmitry Baranchuk, Qi Chen, Lucas Hosseini, Ravishankar Krishnaswamy, Gopal Srinivasa, Suhas Jayaram Subramanya, and Jingdong Wang. Results of the neurips’21 challenge on billion- scale approximate nearest neighbor search. InNeurIPS 2021 Competitions and Demonstration...

  34. [34]

    Diskann: fast accurate billion-point nearest neighbor search on a single node

    Suhas Jayaram Subramanya, Devvrit, Rohan Kadekodi, Ravishankar Krishaswamy, and Har- sha Vardhan Simhadri. Diskann: fast accurate billion-point nearest neighbor search on a single node. InProceedings of the 33rd International Conference on Neural Information Processing Systems, 2019

  35. [35]

    Tenenbaum, Vin de Silva, and John C

    Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction.Science, 290(5500):2319–2323, 2000

  36. [36]

    Toussaint

    Godfried T. Toussaint. The relative neighbourhood graph of a finite planar set.Pattern Recognition, 12(4):261–268, 1980

  37. [37]

    Llama: Open and efficient foundation language models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. 2023

  38. [38]

    Steiner-hardness: A query hardness measure for graph-based ann indexes.Proc

    Zeyu Wang, Qitong Wang, Xiaoxing Cheng, Peng Wang, Themis Palpanas, and Wei Wang. Steiner-hardness: A query hardness measure for graph-based ann indexes.Proc. VLDB Endow., 17(13):4668–4682, September 2024

  39. [39]

    Cspg: Crossing sparse proximity graphs for approximate nearest neighbor search

    Ming Yang, Yuzheng Cai, and Weiguo Zheng. Cspg: Crossing sparse proximity graphs for approximate nearest neighbor search. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 103076–103100. Curran Associates, Inc., 2024

  40. [40]

    Incremental isometric embedding of high-dimensional data using connected neighborhood graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):86–98, 2009

    Dongfang Zhao and Li Yang. Incremental isometric embedding of high-dimensional data using connected neighborhood graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):86–98, 2009

  41. [41]

    Towards efficient index construction and approximate nearest neighbor search in high-dimensional spaces.Proc

    Xi Zhao, Yao Tian, Kai Huang, Bolong Zheng, and Xiaofang Zhou. Towards efficient index construction and approximate nearest neighbor search in high-dimensional spaces.Proc. VLDB Endow., 16(8):1979–1991, April 2023. 12 A Complete Proofs Proposition A.1(Monotonicity).The mapping function Φ is strictly decreasing with respect to the estimated local intrinsic...

  42. [42]

    Let s=∥u−v∥ be the distance between the current node and the candidate neighbor (the step size)

    Geometric Condition for Improvement.To make progress, the algorithm must identify a neighbor v such that the distance to the query is reduced by a factor ϵ (where 0< ϵ <1 ), i.e., ∥v−q∥ ≤(1−ϵ)r . Let s=∥u−v∥ be the distance between the current node and the candidate neighbor (the step size). Applying the Law of Cosines to triangle△uvq, we have: ∥v−q∥ 2 =r...

  43. [43]

    The probability Psuccess of finding a neighbor in the improving cone corresponds to the ratio of the surface area of the spherical cap C(θmax) to the total surface area of Sd−1

    Volume Concentration of the Improving Region.In the tangent space at u, the directions of neighbors are uniformly distributed on the unit hypersphere Sd−1. The probability Psuccess of finding a neighbor in the improving cone corresponds to the ratio of the surface area of the spherical cap C(θmax) to the total surface area of Sd−1. This ratio is given by ...

  44. [44]

    The denominator relates to the surface area of the hypersphere and can be evaluated using the standard integral identity for powers of sine

    Asymptotic Analysis.We analyze the asymptotic behavior of this ratio as d→ ∞ . The denominator relates to the surface area of the hypersphere and can be evaluated using the standard integral identity for powers of sine. Exploiting the symmetry ofsinϕaroundπ/2, we have: Z π 0 sind−2 ϕ dϕ= 2 Z π 2 0 sind−2 ϕ dϕ = 2· √π 2 Γ( d−1 2 ) Γ( d

  45. [45]

    = √π Γ( d−1 2 ) Γ( d

  46. [46]

    Settingx=d/2anda=−1/2, we obtain the approximation: √π Γ( d 2 − 1 2) Γ( d

    (40) For high-dimensional spaces (d≫1 ), we apply the asymptotic property of the Gamma function ratio, Γ(x+a) Γ(x) ≈x a. Settingx=d/2anda=−1/2, we obtain the approximation: √π Γ( d 2 − 1 2) Γ( d

  47. [47]

    Since the integrand sind−2 ϕ decays exponentially fast away from θmax, the integral is dominated by the contribution near the upper limit

    ≈ √π d 2 − 1 2 = r 2π d .(41) For the numerator, we exploit the concentration of measure near the integration boundary θmax. Since the integrand sind−2 ϕ decays exponentially fast away from θmax, the integral is dominated by the contribution near the upper limit. We apply integration by parts to extract the leading order term. Recall that d dϕ(sind−1 ϕ) =...

  48. [48]

    Thus: E[Ndist]≥Ω √ d·exp(λ·d) .(47) This confirms that the routing cost is dominated by the exponential termexp(λ·LID(q)) , necessitating the adaptive beam width design in MCGI

    Complexity Bound.The expected number of distance evaluations follows E[Ndist] = 1/Psuccess. Thus: E[Ndist]≥Ω √ d·exp(λ·d) .(47) This confirms that the routing cost is dominated by the exponential termexp(λ·LID(q)) , necessitating the adaptive beam width design in MCGI. B Experimental Details Datasets.We evaluate our method on four standard benchmarks, ran...