pith. sign in

arxiv: 2605.25522 · v1 · pith:GVYTPYMInew · submitted 2026-05-25 · 💻 cs.AR

Co-Designing Graph-based Approximate Nearest Neighbor Search at Billion Scale for Processing-in-Memory

Pith reviewed 2026-06-29 19:48 UTC · model grok-4.3

classification 💻 cs.AR
keywords approximate nearest neighbor searchprocessing-in-memorygraph-based indexingbillion-scale datasetsalgorithm-architecture co-designhigh-recall searchmemory-bound workloads
0
0 comments X

The pith

Co-design of compacted index layout, pipelined scheduler, and multiplication-free kernel enables efficient graph-based ANNS on PIM at billion scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that graph-based approximate nearest neighbor search, a memory-bound workload central to AI systems, can be adapted to processing-in-memory hardware by addressing its architectural mismatches. It does this via a compacted index that shrinks the memory footprint by 14.5x, an asynchronous pipelined scheduler that saturates the interconnect, and a multiplication-free distance kernel that holds recall loss below 0.08 percent. If these hold, the design delivers substantially higher throughput than CPUs, GPUs, or earlier PIM approaches while scaling to multi-node setups. A sympathetic reader would care because current hardware hits bandwidth walls on billion-scale indexes, and PIM offers internal bandwidth that could change what large-scale search is practical.

Core claim

The central claim is that the three-component co-design overcomes PIM limitations of small local memories, costly inter-unit communication, host coordination overhead, and weak in-memory compute units. The compacted index layout shrinks the PIM-resident footprint by 14.5x. The asynchronous pipelined scheduler keeps the host-to-PIM interconnect saturated. The multiplication-free distance kernel loses under 0.08 percent recall. On three billion-scale benchmarks this yields up to 20x and 17.1x higher throughput than CPU and GPU baselines, 129x over prior PIM accelerators in the high-recall regime, and graceful scaling across multi-node deployments and emerging PIM architectures.

What carries the argument

The algorithm-architecture co-design of compacted index layout, asynchronous pipelined scheduler, and multiplication-free distance kernel.

If this is right

  • Achieves up to 20x higher throughput than CPU baselines on billion-scale benchmarks.
  • Delivers up to 17.1x higher throughput than GPU baselines.
  • Outperforms prior PIM accelerators by 129x in the high-recall regime.
  • Scales gracefully across multi-node deployments and emerging PIM architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The multiplication-free kernel may reduce energy use in other memory-bound similarity tasks beyond ANNS.
  • Future PIM hardware could add native support for irregular graph traversals based on the mismatches identified here.
  • The compacted layout technique might extend to other graph algorithms that suffer from high memory footprint on PIM.

Load-bearing premise

The three components overcome PIM architectural mismatches of small local memory, costly communication, host overhead, and weak compute units without introducing new performance or accuracy bottlenecks.

What would settle it

Running the full design on physical PIM hardware for the three billion-scale benchmarks and measuring whether throughput reaches the reported multiples over CPU, GPU, and prior PIM baselines while recall loss stays below 0.08 percent.

Figures

Figures reproduced from arXiv: 2605.25522 by Amelie Chi Zhou, Jintao Meng, Minwen Deng, Sitian Chen, Yao Chen, Yusen Li.

Figure 1
Figure 1. Figure 1: Roofline Analysis of graph-based ANNS on SIFT. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Query processing with greedy beam search. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Commodity PIM architecture. on optimizing the query execution path, as it dominates end￾to-end ANNS performance, while treating index construction as an offline preprocessing step. By aggressively reducing the cost of distance computation, modern graph-based ANNS algorithms shift their performance bottleneck toward memory access and graph traversal, as confirmed by the roofline analysis in [PITH_FULL_IMAG… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of PIMCQG local memory capacity (C1) forces aggressive graph par￾titioning, which in turn amplifies inter-PU communication during traversal (C2). High communication overhead exacer￾bates coordination costs and load imbalance across PUs (C3), while restricted PU compute capability (C4) further constrains the choice of distance computation and pruning strategies that could otherwise mitigate these e… view at source ↗
Figure 5
Figure 5. Figure 5: Index structure of SymphonyQG and PIMCQG [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of Host-PU collaboration strategy. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of different scheduling strategies. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Recall of PIMCQG using node-specific cos(θ) or fixed α on SIFT. value has a convenient binary representation, 1.012, which translates directly into efficient bitwise operations: x · 1.25 ≈ x+(x >> 2). When a specific dataset or cluster deviates from this default, α is calibrated during index construction to the nearest hardware-friendly binary-shift equivalent. As a result, the expensive floating-point sca… view at source ↗
Figure 10
Figure 10. Figure 10: QPS vs. recall@10 of compared baselines. Each point is obtained by varying the search-cluster count and [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Energy efficiency (QPS/W) vs. recall@10 for the compared baselines across three datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance breakdown. The overall execution time [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: Throughput compar￾ison of different scheduling strategies SIFT1B SPACEV1B SSN1B 0.00 0.05 0.10 0.15 Time (s) -60.4% -60.8% -49.6% w/o MF w/ MF [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: Multi-node scalability of PIMCQG on SIFT1B. UPMEM PIM-HBM AiM 10 0 10 1 10 2 Speedup vs CPU vs GPU [PITH_FULL_IMAGE:figures/full_fig_p010_18.png] view at source ↗
read the original abstract

Approximate Nearest Neighbor Search (ANNS) is a core primitive in modern AI systems, and graph-based methods currently offer the best accuracy-efficiency trade-off at scale. The workload is fundamentally memory-bound: graph traversal produces frequent, irregular memory accesses that cap CPU throughput at main-memory bandwidth, while GPUs lack the high-bandwidth memory capacity to host billion-scale indexes. Processing-in-Memory (PIM) is a natural candidate, as placing computation next to data unlocks the abundant internal bandwidth that such bandwidth-starved workloads demand. Porting graph-based ANNS to PIM, however, exposes several architectural mismatches: each processing unit has only a small local memory, inter-unit communication is costly, host coordination adds overhead, and in-memory compute units are relatively weak -- limitations that have forced prior PIM-based ANNS designs to fall back on cluster-based indexing, whose recall ceiling is far below that of graph methods. This paper presents an algorithm-architecture co-design that overcomes these obstacles through three components: a compacted index layout that shrinks the PIM-resident memory footprint by 14.5x; an asynchronous pipelined scheduler that keeps the host-to-PIM interconnect saturated; and a multiplication-free distance kernel that loses under 0.08% recall. Across three billion-scale benchmarks, the proposed design achieves up to 20x and 17.1x higher throughput than CPU and GPU baselines, respectively, outperforms prior PIM accelerators by 129x in the high-recall regime, and scales gracefully across multi-node deployments and emerging PIM architecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript presents a co-design for graph-based ANNS on PIM at billion scale. It proposes three components to address PIM architectural mismatches: a compacted index layout that reduces the memory footprint by 14.5x, an asynchronous pipelined scheduler to keep the host-to-PIM interconnect saturated, and a multiplication-free distance kernel that incurs less than 0.08% recall loss. The design is evaluated on three billion-scale benchmarks, claiming up to 20x higher throughput than CPU baselines, 17.1x over GPU, and 129x over prior PIM accelerators in the high-recall regime, with graceful scaling in multi-node and emerging PIM setups.

Significance. If the empirical results are robust, this work is significant as it demonstrates how to adapt high-accuracy graph ANNS to PIM hardware, overcoming limitations that previously forced lower-recall cluster-based methods. The throughput gains are substantial for a memory-bound workload, and the co-design approach could influence future PIM software-hardware co-optimization for irregular access patterns in AI systems. The paper provides concrete numbers and addresses scaling, which strengthens the contribution.

minor comments (1)
  1. [Abstract] Abstract: The abstract states specific quantitative claims (20x, 17.1x, 129x throughput; 14.5x footprint; 0.08% recall) without referencing the corresponding evaluation sections, figures, or tables. The manuscript should cross-reference these claims to the experimental results to facilitate verification.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work on co-designing graph-based ANNS for PIM at billion scale, including recognition of the 14.5x compacted index, asynchronous scheduler, and multiplication-free kernel, as well as the throughput gains and scaling results. The recommendation for minor revision is noted, and we will address any editorial or minor issues in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical hardware-software co-design for graph-based ANNS on PIM hardware. It reports measured throughput and recall numbers from three billion-scale benchmarks without any equations, derivations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to inputs by construction. The three components (compacted index, pipelined scheduler, multiplication-free kernel) are described as engineering solutions whose performance is validated externally against CPU/GPU baselines and prior PIM designs; none of the load-bearing results are shown to be tautological or self-referential. This is the expected outcome for a systems paper whose contributions are implementation and measurement rather than mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about PIM hardware properties and graph ANNS workload characteristics; no free parameters or invented entities are identifiable from the abstract.

axioms (1)
  • domain assumption PIM units have only small local memory, costly inter-unit communication, weak compute, and host coordination overhead that prior designs could not overcome with graph methods.
    Explicitly listed in the abstract as the architectural mismatches that forced prior PIM ANNS designs to use lower-recall cluster indexing.

pith-pipeline@v0.9.1-grok · 5824 in / 1353 out tokens · 46205 ms · 2026-06-29T19:48:57.216162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 50 canonical work pages

  1. [1]

    Similarity search in the blink of an eye with compressed indices,

    C. Aguerrebere, I. Bhati, M. Hildebrand, M. Tepper, and T. Willke, “Similarity search in the blink of an eye with compressed indices,”arXiv preprint arXiv:2304.04759, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.04759

  2. [2]

    Fafnir: Accelerating sparse gathering by using efficient near-memory intelligent reduction,

    B. Asgari, R. Hadidi, J. Cao, D. E. Shim, S.-K. Lim, and H. Kim, “Fafnir: Accelerating sparse gathering by using efficient near-memory intelligent reduction,” in2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 908–920. [Online]. Available: https://doi.org/10.1109/HPCA51647. 2021.00080

  3. [3]

    Pimpam: Efficient graph pattern matching on real processing-in-memory hardware,

    S. Cai, B. Tian, H. Zhang, and M. Gao, “Pimpam: Efficient graph pattern matching on real processing-in-memory hardware,”Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–25, 2024. [Online]. Available: https://doi.org/10.1145/3654964

  4. [4]

    Drim-ann: An approximate nearest neighbor search engine based on commercial dram-pims,

    M. Chen, T. Han, C. Liu, S. Liang, K. Yu, L. Dai, Z. Yuan, Y . Wang, L. Zhang, H. Liet al., “Drim-ann: An approximate nearest neighbor search engine based on commercial dram-pims,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 820–836. [Online]. Available: https://doi.org/10.1145...

  5. [5]

    Finger: Fast inference for graph-based approximate nearest neighbor search,

    P. Chen, W.-C. Chang, J.-Y . Jiang, H.-F. Yu, I. Dhillon, and C.-J. Hsieh, “Finger: Fast inference for graph-based approximate nearest neighbor search,” inProceedings of the ACM Web Conference 2023, 2023, pp. 3225–3235. [Online]. Available: https://doi.org/10.1145/3543507. 3583318

  6. [6]

    Spann: Highly-efficient billion-scale approximate nearest neighborhood search,

    Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, M. Yang, and J. Wang, “Spann: Highly-efficient billion-scale approximate nearest neighborhood search,”Advances in Neural Information Processing Systems, vol. 34, pp. 5199–5212, 2021. [Online]. Available: https://doi.org/10.5555/3540261.3540659

  7. [7]

    Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,

    R. Chen, B. Liu, H. Zhu, Y . Wang, Q. Li, B. Ma, Q. Hua, J. Jiang, Y . Xu, H. Denget al., “Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,” in Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 3013–3022. [Online]. Available: https://doi.org/10.1145/35118...

  8. [8]

    Upanns: Enhancing billion-scale anns efficiency with real-world pim architecture,

    S. Chen, A. C. Zhou, Y . Shi, Y . Li, and X. Yao, “Upanns: Enhancing billion-scale anns efficiency with real-world pim architecture,” in SC25: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2025, pp. 1–11. [Online]. Available: https://doi.org/10.1145/3712285.3759777

  9. [9]

    {PIMLex}: A {High-Performance}learned index with{Processing-in-Memory},

    L. Cui, K. Yang, Y . Li, G. Wang, and X. Liu, “{PIMLex}: A {High-Performance}learned index with{Processing-in-Memory},” in 23rd USENIX Conference on File and Storage Technologies (FAST 25), 2025, pp. 287–303. [Online]. Available: https://www.usenix.org/ conference/fast25/presentation/cui

  10. [10]

    The true processing in memory accelerator,

    F. Devaux, “The true processing in memory accelerator,” in2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 2019, pp. 1–24. [Online]. Available: https://doi.org/10.1109/HOTCHIPS.2019. 8875680

  11. [11]

    The journey to a knowledgeable assistant with retrieval-augmented generation (rag),

    X. L. Dong, “The journey to a knowledgeable assistant with retrieval-augmented generation (rag),” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 4–4. [Online]. Available: https://doi.org/10.1145/3616855.3638207

  12. [12]

    Facebook SimSearchNet++,

    Facebook, “Facebook SimSearchNet++,” https://dl.fbaipublicfiles.com/ billion-scale-ann-benchmarks/FB ssnpp database.u8bin, 2026

  13. [13]

    Facebook AI Research, “Faiss,” https://github.com/facebookresearch/ faiss

  14. [14]

    High dimensional similarity search with satellite system graph: Efficiency, scalability, and unindexed query compatibility,

    C. Fu, C. Wang, and D. Cai, “High dimensional similarity search with satellite system graph: Efficiency, scalability, and unindexed query compatibility,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 8, pp. 4139–4150, 2021. [Online]. Available: https://doi.org/10.1109/TPAMI.2021.3067706

  15. [15]

    Fast approximate nearest neighbor search with the navigating spreading-out graph,

    C. Fu, C. Xiang, C. Wang, and D. Cai, “Fast approximate nearest neighbor search with the navigating spreading-out graph,” arXiv preprint arXiv:1707.00143, 2017. [Online]. Available: https: //doi.org/10.48550/arXiv.1707.00143

  16. [16]

    Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,

    J. Gao and C. Long, “Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,” Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–27, 2024. [Online]. Available: https://doi.org/10.1145/3654970

  17. [17]

    Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system,

    J. G ´omez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system,” IEEE Access, vol. 10, pp. 52 565–52 608, 2022. [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3174101

  18. [18]

    idec: indexable distance estimating codes for approximate nearest neighbor search,

    L. Gong, H. Wang, M. Ogihara, and J. Xu, “idec: indexable distance estimating codes for approximate nearest neighbor search,”Proceedings of the VLDB Endowment, vol. 13, no. 9, 2020. [Online]. Available: https://doi.org/10.14778/3397230.3397243

  19. [19]

    Symphonyqg: Towards symphonious integration of quantization and graph for approximate nearest neighbor search,

    Y . Gou, J. Gao, Y . Xu, and C. Long, “Symphonyqg: Towards symphonious integration of quantization and graph for approximate nearest neighbor search,”Proceedings of the ACM on Management of Data, vol. 3, no. 1, pp. 1–26, 2025. [Online]. Available: https://doi.org/10.1145/3709730

  20. [20]

    Ggnn: Graph-based gpu nearest neighbor search,

    F. Groh, L. Ruppert, P. Wieschollek, and H. P. Lensch, “Ggnn: Graph-based gpu nearest neighbor search,”IEEE Transactions on Big Data, vol. 9, no. 01, pp. 267–279, 2023. [Online]. Available: https://doi.org/10.1109/TBDATA.2022.3161156

  21. [21]

    Pim is all you need: A cxl-enabled gpu-free system for large language model inference,

    Y . Gu, A. Khadem, S. Umesh, N. Liang, X. Servot, O. Mutlu, R. Iyer, and R. Das, “Pim is all you need: A cxl-enabled gpu-free system for large language model inference,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2025, pp. 862–881. [Online]. Available: https://...

  22. [22]

    INRIA, “SIFT1B,” http://corpus-texmex.irisa.fr/, 2026

  23. [23]

    {CXL- ANNS}:{Software-Hardware}collaborative memory disaggregation and computation for{Billion-Scale}approximate nearest neighbor search,

    J. Jang, H. Choi, H. Bae, S. Lee, M. Kwon, and M. Jung, “{CXL- ANNS}:{Software-Hardware}collaborative memory disaggregation and computation for{Billion-Scale}approximate nearest neighbor search,” in2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 585–600. [Online]. Available: https://www.usenix.org/ conference/atc23/presentation/jang

  24. [24]

    Diskann: Fast accurate billion-point nearest neighbor search on a single node,

    S. Jayaram Subramanya, F. Devvrit, H. V . Simhadri, R. Krishnawamy, and R. Kadekodi, “Diskann: Fast accurate billion-point nearest neighbor search on a single node,”Advances in neural information processing Systems, vol. 32, 2019. [Online]. Available: https: //dl.acm.org/doi/abs/10.5555/3454287.3455520

  25. [25]

    Co- design hardware and algorithm for vector search,

    W. Jiang, S. Li, Y . Zhu, J. de Fine Licht, Z. He, R. Shi, C. Renggli, S. Zhang, T. Rekatsinas, T. Hoefleret al., “Co- design hardware and algorithm for vector search,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–15. [Online]. Available: https://doi.org/10.1145/3581784.3607045

  26. [26]

    Near-memory processing in action: Accelerating personalized recommendation with axdimm,

    L. Ke, X. Zhang, J. So, J.-G. Lee, S.-H. Kang, S. Lee, S. Han, Y . Cho, J. H. Kim, Y . Kwonet al., “Near-memory processing in action: Accelerating personalized recommendation with axdimm,” IEEE Micro, vol. 42, no. 1, pp. 116–127, 2021. [Online]. Available: https://doi.org/10.1109/MM.2021.3097700

  27. [27]

    Bang: Billion-scale approximate nearest neighbor search using a single gpu,

    S. Khan, S. Singh, H. V . Simhadri, J. Veduradaet al., “Bang: Billion-scale approximate nearest neighbor search using a single gpu,”arXiv preprint arXiv:2401.11324, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2401.11324

  28. [28]

    Accelerating large- scale graph-based nearest neighbor search on a computational storage platform,

    J.-H. Kim, Y .-R. Park, J. Do, S.-Y . Ji, and J.-Y . Kim, “Accelerating large- scale graph-based nearest neighbor search on a computational storage platform,”IEEE Transactions on Computers, vol. 72, no. 1, pp. 278–290,

  29. [29]

    Available: https://doi.org/10.1109/TC.2022.3155956

    [Online]. Available: https://doi.org/10.1109/TC.2022.3155956

  30. [30]

    {PathWeaver}: A{High-Throughput}{Multi-GPU}system for{Graph-Based}approximate nearest neighbor search,

    S. Kim, S. Park, S. U. Noh, J. Hong, T. Kwon, H. Lim, and J. Lee, “{PathWeaver}: A{High-Throughput}{Multi-GPU}system for{Graph-Based}approximate nearest neighbor search,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025, pp. 1501–1517. [Online]. Available: https://www.usenix.org/conference/ atc25/presentation/kim

  31. [31]

    Cosmos: A cxl-based full in-memory system for approximate nearest neighbor search,

    S. Ko, H. Shim, W. Doh, S. Yun, J. So, Y . Kwon, S.-S. Park, S.-D. Roh, M. Yoon, T. Songet al., “Cosmos: A cxl-based full in-memory system for approximate nearest neighbor search,” IEEE Computer Architecture Letters, 2025. [Online]. Available: https://doi.org/10.1109/LCA.2025.3570235 11

  32. [32]

    Pimbeam: Efficient regular path queries over graph database using processing-in-memory,

    W. Kong, S. Zheng, Y . Hua, R. Ma, Y . Wen, G. Wang, C. Zhou, and L. Huang, “Pimbeam: Efficient regular path queries over graph database using processing-in-memory,”IEEE Transactions on Parallel and Distributed Systems, 2025. [Online]. Available: https://doi.org/10.1109/TPDS.2025.3547365

  33. [33]

    System architecture and software stack for gddr6-aim,

    Y . Kwon, K. Vladimir, N. Kim, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, G. Kim, B. Anet al., “System architecture and software stack for gddr6-aim,” in2022 IEEE Hot Chips 34 Symposium (HCS). IEEE, 2022, pp. 1–25. [Online]. Available: https://doi.org/10.1109/HCS55958.2022.9895629

  34. [34]

    25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2 tflops programmable computing unit using bank-level parallelism, for machine learning applications,

    Y .-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, O. Seongil, H.-S. Yu, H. Lee, S. Y . Kimet al., “25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2 tflops programmable computing unit using bank-level parallelism, for machine learning applications,” in2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64....

  35. [35]

    Cost- effective llm accelerator using processing in memory technology,

    H. Lee, G. Kim, D. Yun, I. Kim, Y . Kwon, and E. Lim, “Cost- effective llm accelerator using processing in memory technology,” in2024 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 2024, pp. 1–2. [Online]. Available: https://doi.org/10.1109/VLSITechnologyandCir46783.2024.10631397

  36. [36]

    Hnswlib - fast approximate nearest neighbor search

    Leonid Boytsov Yury Malkov., “Hnswlib - fast approximate nearest neighbor search.” https://github.com/nmslib/hnswlib

  37. [37]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474,

  38. [38]

    Available: https://proceedings.neurips.cc/paper files/ paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf

    [Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf

  39. [39]

    Pim-dl: Expanding the applicability of commodity dram-pims for deep learning via algorithm-system co-optimization,

    C. Li, Z. Zhou, Y . Wang, F. Yang, T. Cao, M. Yang, Y . Liang, and G. Sun, “Pim-dl: Expanding the applicability of commodity dram-pims for deep learning via algorithm-system co-optimization,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 879–896. [Online...

  40. [40]

    Scalable graph indexing using gpus for approximate nearest neighbor search,

    Z. Li, X. Ke, Y . Zhu, B. Yu, B. Zheng, and Y . Gao, “Scalable graph indexing using gpus for approximate nearest neighbor search,” Proceedings of the ACM on Management of Data, vol. 3, no. 6, pp. 1–27, 2025. [Online]. Available: https://doi.org/10.1145/3769825

  41. [41]

    Vstore: in-storage graph based vector search accelerator,

    S. Liang, Y . Wang, Z. Yuan, C. Liu, H. Li, and X. Li, “Vstore: in-storage graph based vector search accelerator,” inProceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 997–1002. [Online]. Available: https://doi.org/10.1145/3489517.3530560

  42. [42]

    Heterrag: Heterogeneous processing-in-memory acceleration for retrieval-augmented generation,

    C. Liu, H. Liu, D. Chen, Y . Huang, Y . Zhang, W. Xiao, X. Liao, and H. Jin, “Heterrag: Heterogeneous processing-in-memory acceleration for retrieval-augmented generation,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 884–898. [Online]. Available: https://doi.org/10.1145/3695053.3731089

  43. [43]

    Accelerating personalized recommendation with cross-level near-memory processing,

    H. Liu, L. Zheng, Y . Huang, C. Liu, X. Ye, J. Yuan, X. Liao, H. Jin, and J. Xue, “Accelerating personalized recommendation with cross-level near-memory processing,” inProceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13. [Online]. Available: https://doi.org/10.1145/3579371.3589101

  44. [44]

    Ansel, E

    Z. Liu, W. Ni, J. Leng, Y . Feng, C. Guo, Q. Chen, C. Li, M. Guo, and Y . Zhu, “Juno: optimizing high-dimensional approximate nearest neighbour search with sparsity-aware algorithm and ray-tracing core mapping,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, ...

  45. [45]

    Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,

    Y . A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 4, pp. 824–836, 2018. [Online]. Available: https: //doi.org/10.1109/TPAMI.2018.2889473

  46. [46]

    SPACEV1B,

    Microsoft, “SPACEV1B,” https://github.com/microsoft/SPTAG/tree/ main/datasets/SPACEV1B, 2026

  47. [47]

    InIEEE 40th International Conference on Data Engineering

    H. Ootomo, A. Naruse, C. Nolet, R. Wang, T. Feher, and Y . Wang, “Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus,” in2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 2024, pp. 4236–4247. [Online]. Available: https://doi.org/10.1109/icde60146.2024.00323

  48. [48]

    Pim-ai: A novel architecture for high-efficiency llm inference,

    C. Ortega, Y . Falevoz, and R. Ayrignac, “Pim-ai: A novel architecture for high-efficiency llm inference,”arXiv preprint arXiv:2411.17309,

  49. [49]
  50. [50]

    PIM-HBM,

    Samsung, “PIM-HBM,” https://github.com/SAITPublic/PIMSimulator

  51. [51]

    Gpu-native approximate nearest neighbor search with ivf-rabitq: Fast index build and search,

    J. Shi, J. Gao, J. Xia, T. B. Feh ´er, and C. Long, “Gpu-native approximate nearest neighbor search with ivf-rabitq: Fast index build and search,”arXiv preprint arXiv:2602.23999, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.23999

  52. [52]

    SK Hynix, “AiM,” https://github.com/arkhadem/aim simulator

  53. [53]

    An efficient fpga implementation of approximate nearest neighbor search,

    Y . Song, C. Liu, R. Zhang, D. Zhu, and Z. Wang, “An efficient fpga implementation of approximate nearest neighbor search,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2025. [Online]. Available: https://doi.org/10.1109/TVLSI.2025.3544342

  54. [54]

    Scalable billion-point approximate nearest neighbor search using {SmartSSDs},

    B. Tian, H. Liu, Z. Duan, X. Liao, H. Jin, and Y . Zhang, “Scalable billion-point approximate nearest neighbor search using {SmartSSDs},” in2024 USENIX Annual Technical Conference (USENIX ATC 24), 2024, pp. 1135–1150. [Online]. Available: https://www.usenix.org/conference/atc24/presentation/tian

  55. [55]

    Fusionanns: An efficient cpu/gpu cooperative processing architecture for billion-scale approximate nearest neighbor search,

    B. Tian, H. Liu, Y . Tang, S. Xiao, Z. Duan, X. Liao, X. Zhang, J. Zhu, and Y . Zhang, “Fusionanns: An efficient cpu/gpu cooperative processing architecture for billion-scale approximate nearest neighbor search,”arXiv preprint arXiv:2409.16576, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2409.16576

  56. [56]

    Recommendation of food items for thyroid patients using content-based knn method,

    V . S. Vairale and S. Shukla, “Recommendation of food items for thyroid patients using content-based knn method,” inData Science and Security: Proceedings of IDSCS 2020. Springer, 2020, pp. 71–77. [Online]. Available: https://doi.org/10.1007/978-981-15-5309-7 8

  57. [57]

    Accelerating graph indexing for anns on modern cpus,

    M. Wang, H. Wu, X. Ke, Y . Gao, Y . Zhu, and W. Zhou, “Accelerating graph indexing for anns on modern cpus,”Proceedings of the ACM on Management of Data, vol. 3, no. 3, pp. 1–29, 2025. [Online]. Available: https://doi.org/10.1145/3725260

  58. [58]

    Ems-i: An efficient memory system design with specialized caching mechanism for recommendation inference,

    Y . Wang, S. Li, Q. Zheng, A. Chang, H. Li, and Y . Chen, “Ems-i: An efficient memory system design with specialized caching mechanism for recommendation inference,”ACM Transactions on Embedded Computing Systems, vol. 22, no. 5s, pp. 1–22, 2023. [Online]. Available: https://doi.org/10.1145/3609384

  59. [59]

    Ndsearch: Accelerating graph-traversal-based approximate nearest neighbor search through near data processing,

    Y . Wang, S. Li, Q. Zheng, L. Song, Z. Li, A. Chang, Y . Chen et al., “Ndsearch: Accelerating graph-traversal-based approximate nearest neighbor search through near data processing,” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 368–381. [Online]. Available: https://doi.org/10.1109/ISCA59077.2024.00035

  60. [60]

    Turbocharge{ANNS}on real{Processing-in-Memory} by enabling{Fine-Grained}{Per-PIM-Core}scheduling,

    P. Wu, M. Xie, E. Zhao, D. Zhang, J. Wang, X. Liang, K. Ren, and Y . Chai, “Turbocharge{ANNS}on real{Processing-in-Memory} by enabling{Fine-Grained}{Per-PIM-Core}scheduling,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025, pp. 1223–1241. [Online]. Available: https://www.usenix.org/conference/ atc25/presentation/wu-puqing

  61. [61]

    Proxima: Near-storage acceleration for graph-based approximate nearest neighbor search in 3d nand,

    W. Xu, J. Chen, P.-K. Hsu, J. Kang, M. Zhou, S. Pinge, S. Yu, and T. Rosing, “Proxima: Near-storage acceleration for graph-based approximate nearest neighbor search in 3d nand,” IEEE Transactions on Computers, 2026. [Online]. Available: http: //doi.org/10.1109/tc.2026.3671718

  62. [62]

    Neighborhood Graph and Tree for Indexing High- dimensional Data

    Yahoo Japan, “Neighborhood Graph and Tree for Indexing High- dimensional Data.” https://github.com/yahoojapan/NGT

  63. [63]

    Df-gas: A distributed fpga-as-a-service architecture towards billion-scale graph-based approximate nearest neighbor search,

    S. Zeng, Z. Zhu, J. Liu, H. Zhang, G. Dai, Z. Zhou, S. Li, X. Ning, Y . Xie, H. Yanget al., “Df-gas: A distributed fpga-as-a-service architecture towards billion-scale graph-based approximate nearest neighbor search,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023, pp. 283–296. [Online]. Available: https://doi...

  64. [64]

    Nl-dpe: An analog in-memory non-linear dot product engine for efficient cnn and llm inference,

    L. Zhao, L. Buonanno, A. Gajjar, J. Moon, A. Natarajan, S. Serebryakov, R. M. Roth, X. Sheng, Y . Zhang, P. Faraboschiet al., “Nl-dpe: An analog in-memory non-linear dot product engine for efficient cnn and llm inference,”arXiv preprint arXiv:2511.13950, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2511.13950

  65. [65]

    Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index,

    Y . Zheng, Q. Guo, A. K. Tung, and S. Wu, “Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index,” inProceedings of the 2016 International Conference on Management of Data, 2016, pp. 2023–2037. [Online]. Available: https://doi.org/10.1145/2882903.2882930

  66. [66]

    Processing-in-hierarchical-memory architecture for billion- scale approximate nearest neighbor search,

    Z. Zhu, J. Liu, G. Dai, S. Zeng, B. Li, H. Yang, and Y . Wang, “Processing-in-hierarchical-memory architecture for billion- scale approximate nearest neighbor search,” in2023 60th ACM/IEEE 12 Design Automation Conference (DAC). IEEE, 2023, pp. 1–6. [Online]. Available: https://doi.org/10.1109/DAC56929.2023.10247946

  67. [67]

    Pyglass - Graph Library for Approximate Similarity Search

    Zilliz, “Pyglass - Graph Library for Approximate Similarity Search.” https://github.com/zilliztech/pyglass. 13