LSM-VEC: A Large-Scale Disk-Based System for Dynamic Vector Search
Pith reviewed 2026-05-22 02:29 UTC · model grok-4.3
The pith
LSM-VEC layers hierarchical graphs over LSM-tree storage to support out-of-place updates for billion-scale vector search.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LSM-VEC integrates hierarchical graph indexing with LSM-tree storage by distributing the proximity graph across multiple levels, which permits out-of-place vector updates. Search proceeds via a sampling-based probabilistic strategy with adaptive neighbor selection, while connectivity-aware graph reordering reduces I/O without global reconstruction. On billion-scale datasets the design delivers higher recall, lower query and update latency, and more than 66.2 percent memory savings compared with prior disk-based ANN systems.
What carries the argument
Hierarchical proximity graph distributed across LSM-tree levels, which carries out-of-place updates and supports probabilistic sampling during search.
If this is right
- High-throughput dynamic workloads with frequent inserts and deletes become practical on disk without full index rebuilds.
- Memory consumption drops enough to fit billion-scale indexes on hardware that previously required much larger RAM.
- Query latency and recall improve simultaneously rather than trading one for the other.
- No offline batch construction phase is required when the underlying vector collection changes continuously.
Where Pith is reading between the lines
- The same layering idea could be applied to other graph indexes that currently demand static construction.
- Continuous index maintenance might let retrieval-augmented generation pipelines update their knowledge bases in real time instead of periodic refreshes.
- Streaming vector arrivals could be tested to see whether the LSM-tree levels absorb new data without periodic compaction pauses.
Load-bearing premise
The sampling-based probabilistic search with adaptive neighbors keeps recall close to full deterministic traversal, and the LSM-tree layout prevents excessive I/O amplification under high-rate mixed updates.
What would settle it
A head-to-head run on a billion-vector dataset that shows either recall falling below a competing disk system at matched latency or update throughput dropping sharply under sustained insert-delete traffic.
Figures
read the original abstract
Vector search underpins modern AI applications by supporting approximate nearest neighbor (ANN) queries over high-dimensional embeddings in tasks like retrieval-augmented generation (RAG), recommendation systems, and multimodal search. Traditional ANN search indices (e.g., HNSW) are limited by memory constraints at large data scale. Disk-based indices such as DiskANN reduce memory overhead but rely on offline graph construction, resulting in costly and inefficient vector updates. The state-of-the-art clustering-based approach SPFresh offers better scalability but suffers from reduced recall due to coarse partitioning. Moreover, SPFresh employs in-place updates to maintain its index structure, limiting its efficiency in handling high-throughput insertions and deletions under dynamic workloads. This paper presents LSM-VEC, a disk-based dynamic vector index that integrates hierarchical graph indexing with LSM-tree storage. By distributing the proximity graph across multiple LSM-tree levels, LSM-VEC supports out-of-place vector updates. It enhances search efficiency via a sampling-based probabilistic search strategy with adaptive neighbor selection, and connectivity-aware graph reordering further reduces I/O without requiring global reconstruction. Experiments on billion-scale datasets demonstrate that LSM-VEC consistently outperforms existing disk-based ANN systems. It achieves higher recall, lower query and update latency, and reduces memory footprint by over 66.2%, making it well-suited for real-world large-scale vector search with dynamic updates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents LSM-VEC, a disk-based dynamic vector index that integrates hierarchical graph indexing with LSM-tree storage to enable out-of-place updates for high-throughput insertions and deletions. It introduces a sampling-based probabilistic search strategy with adaptive neighbor selection and connectivity-aware graph reordering to reduce I/O without global reconstruction. Experiments on billion-scale datasets are claimed to show consistent outperformance over existing disk-based ANN systems (e.g., DiskANN, SPFresh) with higher recall, lower query/update latency, and over 66.2% memory footprint reduction.
Significance. If the central experimental claims hold under realistic dynamic workloads, the work would be significant for large-scale vector search in AI applications, as it directly addresses the update inefficiency of offline graph construction in DiskANN and the recall limitations of clustering-based approaches like SPFresh. The LSM-tree integration for dynamic graph maintenance is a novel angle in the disk-based ANN literature.
major comments (2)
- [Abstract and Experimental Evaluation] Abstract and § on experimental evaluation: the headline claims of higher recall, lower latency, and 66.2% memory reduction on billion-scale data rest on unexamined experimental design. The manuscript provides no details on exact baselines, workload mixes (update/query ratio), error bars, exclusion criteria, or whether recall was measured after LSM merges rather than on static snapshots. This directly impacts the load-bearing assumption that the sampling strategy and LSM distribution preserve recall under sustained updates.
- [Method (search strategy and LSM integration)] Method section on search strategy and LSM distribution: the claim that the sampling-based probabilistic search with adaptive neighbor selection preserves recall close to deterministic traversal is not shown to hold once the proximity graph is distributed across LSM levels and merges occur. No ablation or measurement under active merge traffic is described, leaving the I/O amplification bound unverified for the reported gains.
minor comments (2)
- [Abstract] Abstract: the memory reduction figure of 'over 66.2%' should state the precise baseline memory usage and the exact configuration (e.g., which system and index size) for reproducibility.
- [Introduction/Method] Notation: the distinction between 'hierarchical graph indexing' and standard HNSW-style graphs should be clarified early, as the LSM distribution is the key differentiator.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed review. The comments highlight important aspects of experimental rigor and methodological validation that we address below. We commit to revisions that strengthen the presentation without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract and Experimental Evaluation] Abstract and § on experimental evaluation: the headline claims of higher recall, lower latency, and 66.2% memory reduction on billion-scale data rest on unexamined experimental design. The manuscript provides no details on exact baselines, workload mixes (update/query ratio), error bars, exclusion criteria, or whether recall was measured after LSM merges rather than on static snapshots. This directly impacts the load-bearing assumption that the sampling strategy and LSM distribution preserve recall under sustained updates.
Authors: We agree that the experimental section would benefit from greater explicitness. In the revised manuscript we will add a dedicated subsection detailing: the exact baseline implementations and parameter settings for DiskANN and SPFresh; the precise update-to-query ratios employed in the dynamic workloads; error bars computed over five independent runs with reported standard deviations; any exclusion criteria applied to queries; and explicit confirmation that all reported recall figures were obtained after LSM merges and compactions on the live index state. These additions will directly substantiate the claims under sustained dynamic conditions. revision: yes
-
Referee: [Method (search strategy and LSM integration)] Method section on search strategy and LSM distribution: the claim that the sampling-based probabilistic search with adaptive neighbor selection preserves recall close to deterministic traversal is not shown to hold once the proximity graph is distributed across LSM levels and merges occur. No ablation or measurement under active merge traffic is described, leaving the I/O amplification bound unverified for the reported gains.
Authors: The sampling-based search and its recall behavior are evaluated within the complete LSM-VEC system that includes concurrent merges, as reflected in the billion-scale dynamic experiments. Nevertheless, we recognize that an isolated ablation under active merge traffic would provide clearer verification of the I/O bounds. We will therefore include a new ablation in the revision that measures recall and I/O amplification specifically during periods of high merge activity, comparing the probabilistic sampler against deterministic traversal where possible. revision: partial
Circularity Check
No circularity: empirical claims rest on external benchmarks, not self-referential derivations
full rationale
The paper describes LSM-VEC as an engineering system that combines hierarchical graphs with LSM-tree storage for dynamic updates, using sampling-based probabilistic search and connectivity-aware reordering. All performance claims (higher recall, lower latency, 66.2% memory reduction on billion-scale data) are presented as outcomes of experiments against DiskANN and SPFresh rather than quantities derived from fitted parameters or prior self-citations within the same work. No equations, uniqueness theorems, or ansatzes are introduced that reduce by construction to the paper's own inputs; the mechanisms are described as design choices whose validity is checked via independent workload measurements. This is a standard self-contained systems paper whose central results do not collapse into tautology.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
integrates hierarchical graph indexing with LSM-tree storage... sampling-based probabilistic search strategy with adaptive neighbor selection... connectivity-aware graph reordering
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments on billion-scale datasets demonstrate that LSM-VEC consistently outperforms existing disk-based ANN systems
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Opal: Private Memory for Personal AI
Opal enables private long-term memory for personal AI by decoupling reasoning to a trusted enclave with a lightweight knowledge graph and piggybacking reindexing on ORAM accesses.
Reference graph
Works this paper leans on
-
[1]
Amazon. 2025. Opensearch.https://aws.amazon.com/cn/opensearch- service/serverless-vector-database/
work page 2025
-
[2]
Kazuo Aoyama, Kazumi Saito, Hiroshi Sawada, and Naonori Ueda
-
[3]
Fast approximate similarity search based on degree-reduced neighborhood graphs. InProceedings of the 17th ACM SIGKDD interna- tional conference on Knowledge discovery and data mining. 1055–1063
-
[4]
Akhil Arora, Sakshi Sinha, Piyush Kumar, and Arnab Bhattacharya
-
[5]
Hd-index: Pushing the scalability-accuracy boundary for ap- proximate knn search in high-dimensional spaces.arXiv preprint arXiv:1804.06829(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2020. ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms.Information Systems87 (2020), 101374
work page 2020
-
[7]
Erik Bernhardsson. 2017. Annoy: approximate nearest neighbors in C++/Python optimized for memory usage and loading/saving to disk. GitHub https://github. com/spotify/annoy(2017), 0–5
work page 2017
-
[8]
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean- Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning. PMLR, 2206–2240
work page 2022
-
[9]
Yuan Cao, Heng Qi, Wenrui Zhou, Jien Kato, Keqiu Li, Xiulong Liu, and Jie Gui. 2017. Binary hashing for approximate nearest neighbor search on big data: A survey.IEEE Access6 (2017), 2039–2054
work page 2017
-
[10]
Moses S Charikar. 2002. Similarity estimation techniques from round- ing algorithms. InProceedings of the thiry-fourth annual ACM sympo- sium on Theory of computing. 380–388
work page 2002
-
[11]
Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. 2021. Spann: Highly- efficient billion-scale approximate nearest neighborhood search.Ad- vances in Neural Information Processing Systems34 (2021), 5199–5212
work page 2021
-
[12]
SpaceV Contributors. 2021. SPACEV1B: A billion-Scale vector dataset for text descriptors.Webpage. Retrieved March16 (2021), 2023
work page 2021
-
[13]
Scott Cost and Steven Salzberg. 1993. A weighted nearest neighbor algorithm for learning with symbolic features.Machine learning10 (1993), 57–78
work page 1993
-
[14]
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. InProceedings of the 10th ACM conference on recommender systems. 191–198
work page 2016
-
[15]
Facebook. 2020. Faiss.https://github.com/facebookresearch/faiss. Accessed: 2025-03-28
work page 2020
- [16]
-
[17]
Long Gong, Huayi Wang, Mitsunori Ogihara, and Jun Xu. 2020. iDEC: indexable distance estimating codes for approximate nearest neighbor search.Proceedings of the VLDB Endowment13, 9 (2020)
work page 2020
-
[18]
Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating large-scale inference with anisotropic vector quantization. InInternational Conference on Machine Learning. PMLR, 3887–3896
work page 2020
-
[19]
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International conference on machine learning. PMLR, 3929–3938
work page 2020
-
[20]
Shayan Hassantabar, Zeyu Wang, and Niraj K Jha. 2021. SCANN: Synthesis of compact and accurate neural networks.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems41, 9 (2021), 3012–3025
work page 2021
-
[21]
Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng
-
[22]
Query-aware locality-sensitive hashing for approximate nearest neighbor search.Proceedings of the VLDB Endowment9, 1 (2015), 1–12
work page 2015
-
[23]
Masajiro Iwasaki. 2015. Ngt: Neighborhood graph and tree for index- ing
work page 2015
-
[24]
Masajiro Iwasaki and Daisuke Miyazaki. 2018. Optimization of in- dexing based on k-nearest neighbor graph for proximity search in high-dimensional data.arXiv preprint arXiv:1810.07355(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot learning with re- trieval augmented language models.Journal of Machine Learning Research24, 251 (2023), 1–43
work page 2023
-
[26]
Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. 2019. Diskann: Fast accurate billion-point nearest neighbor search on a single node.Ad- vances in neural information processing Systems32 (2019)
work page 2019
-
[27]
Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence33, 1 (2010), 117–128
work page 2010
- [28]
-
[29]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al . 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474
work page 2020
-
[30]
Wuchao Li, Chao Feng, Defu Lian, Yuxin Xie, Haifeng Liu, Yong Ge, and Enhong Chen. 2023. Learning balanced tree indexes for large-scale vector retrieval. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1353–1362
work page 2023
- [31]
-
[32]
Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. 2014. Approximate nearest neighbor algorithm based on navigable small world graphs.Information Systems45 (2014), 61–68
work page 2014
-
[33]
Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE transactions on pattern analysis and machine intelligence42, 4 (2018), 824–836
work page 2018
-
[34]
Magdalen Dobson Manohar, Zheqi Shen, Guy Blelloch, Laxman Dhuli- pala, Yan Gu, Harsha Vardhan Simhadri, and Yihan Sun. 2024. Par- layann: Scalable and deterministic parallel graph-based approximate nearest neighbor search algorithms. InProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Pro- gramming. 270–285
work page 2024
-
[35]
Yitong Meng, Xinyan Dai, Xiao Yan, James Cheng, Weiwen Liu, Jun Guo, Benben Liao, and Guangyong Chen. 2020. Pmd: An optimal transportation-based user distance for recommender systems. InAd- vances in Information Retrieval: 42nd European Conference on IR Re- search, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42. Springer, 272–280
work page 2020
-
[36]
Dingheng Mo, Junfeng Liu, Fan Wang, and Siqiang Luo. 2025. Aster: Enhancing LSM-structures for Scalable Graph Database.Proceedings of the ACM on Management of Data3, 1 (2025), 1–26
work page 2025
-
[37]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763
work page 2021
-
[38]
Jie Ren, Minjia Zhang, and Dong Li. 2020. Hm-ann: Efficient billion- point nearest neighbor search on heterogeneous memory.Advances 13 SOSP ’25, October 13–16, 2025, Seoul, Republic of Korea Shurui Zhong, Dingheng Mo, and Siqiang Luo in Neural Information Processing Systems33 (2020), 10672–10684
work page 2020
-
[39]
Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. InPro- ceedings of the 10th international conference on World Wide Web. 285– 295
work page 2001
-
[40]
Chanop Silpa-Anan and Richard Hartley. 2008. Optimised KD-trees for fast image descriptor matching. In2008 IEEE conference on computer vision and pattern recognition. IEEE, 1–8
work page 2008
- [41]
-
[42]
Bing Tian, Haikun Liu, Yuhang Tang, Shihai Xiao, Zhuohui Duan, Xiaofei Liao, Hai Jin, Xuecang Zhang, Junhua Zhu, and Yu Zhang
-
[43]
In23rd USENIX Conference on File and Storage Technologies (FAST 25)
Towards High-throughput and Low-latency Billion-scale Vector Search via CPU/GPU Collaborative Filtering and Re-ranking. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 171–185
-
[44]
Noam Touitou and Nissim Halabi. 2023. Approximate Nearest Neigh- bor Search through Modern Error-Correcting Codes. InThe Eleventh International Conference on Learning Representations
work page 2023
-
[45]
Hao Wei, Jeffrey Xu Yu, Can Lu, and Xuemin Lin. 2016. Speedup graph processing by graph ordering. InProceedings of the 2016 International Conference on Management of Data. 1813–1828
work page 2016
-
[46]
Yuming Xu, Hengyu Liang, Jin Li, Shuotao Xu, Qi Chen, Qianxi Zhang, Cheng Li, Ziyue Yang, Fan Yang, Yuqing Yang, et al. 2023. Spfresh: Incremental in-place update for billion-scale vector search. InProceed- ings of the 29th Symposium on Operating Systems Principles. 545–561
work page 2023
-
[47]
Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong, and Bingsheng He. 2015. NV-Tree: reducing consistency cost for NVM-based single level systems. In13th USENIX Conference on File and Storage Technologies (FAST 15). 167–181
work page 2015
-
[48]
Shulin Zeng, Zhenhua Zhu, Jun Liu, Haoyu Zhang, Guohao Dai, Zixuan Zhou, Shuangchen Li, Xuefei Ning, Yuan Xie, Huazhong Yang, et al
-
[49]
In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture
DF-GAS: a distributed FPGA-as-a-service architecture towards billion-scale graph-based approximate nearest neighbor search. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 283–296
-
[50]
Minjia Zhang and Yuxiong He. 2018. Zoom: Ssd-based vector search for optimizing accuracy, latency and memory.arXiv preprint arXiv:1809.04067(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[51]
Minjia Zhang, Jie Ren, Zhen Peng, Ruoming Jin, Dong Li, and Bin Ren
-
[52]
iqan: Fast and accurate vector search with efficient intra-query parallelism on multi-core architectures.IEEE Data Eng. Bull.(2023)
work page 2023
-
[53]
Zili Zhang, Fangyue Liu, Gang Huang, Xuanzhe Liu, and Xin Jin. 2024. Fast Vector Query Processing for Large Datasets Beyond GPU Memory with Reordered Pipelining. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 23–40
work page 2024
-
[54]
Wenhui Zhou, Chunfeng Yuan, Rong Gu, and Yihua Huang. 2013. Large scale nearest neighbors search based on neighborhood graph. In 2013 International Conference on Advanced Cloud and Big Data. IEEE, 181–186. 14
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.