Co-Designing Graph-based Approximate Nearest Neighbor Search at Billion Scale for Processing-in-Memory

Amelie Chi Zhou; Jintao Meng; Minwen Deng; Sitian Chen; Yao Chen; Yusen Li

arxiv: 2605.25522 · v1 · pith:GVYTPYMInew · submitted 2026-05-25 · 💻 cs.AR

Co-Designing Graph-based Approximate Nearest Neighbor Search at Billion Scale for Processing-in-Memory

Sitian Chen , Yusen Li , Yao Chen , Minwen Deng , Jintao Meng , Amelie Chi Zhou This is my paper

Pith reviewed 2026-06-29 19:48 UTC · model grok-4.3

classification 💻 cs.AR

keywords approximate nearest neighbor searchprocessing-in-memorygraph-based indexingbillion-scale datasetsalgorithm-architecture co-designhigh-recall searchmemory-bound workloads

0 comments

The pith

Co-design of compacted index layout, pipelined scheduler, and multiplication-free kernel enables efficient graph-based ANNS on PIM at billion scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that graph-based approximate nearest neighbor search, a memory-bound workload central to AI systems, can be adapted to processing-in-memory hardware by addressing its architectural mismatches. It does this via a compacted index that shrinks the memory footprint by 14.5x, an asynchronous pipelined scheduler that saturates the interconnect, and a multiplication-free distance kernel that holds recall loss below 0.08 percent. If these hold, the design delivers substantially higher throughput than CPUs, GPUs, or earlier PIM approaches while scaling to multi-node setups. A sympathetic reader would care because current hardware hits bandwidth walls on billion-scale indexes, and PIM offers internal bandwidth that could change what large-scale search is practical.

Core claim

The central claim is that the three-component co-design overcomes PIM limitations of small local memories, costly inter-unit communication, host coordination overhead, and weak in-memory compute units. The compacted index layout shrinks the PIM-resident footprint by 14.5x. The asynchronous pipelined scheduler keeps the host-to-PIM interconnect saturated. The multiplication-free distance kernel loses under 0.08 percent recall. On three billion-scale benchmarks this yields up to 20x and 17.1x higher throughput than CPU and GPU baselines, 129x over prior PIM accelerators in the high-recall regime, and graceful scaling across multi-node deployments and emerging PIM architectures.

What carries the argument

The algorithm-architecture co-design of compacted index layout, asynchronous pipelined scheduler, and multiplication-free distance kernel.

If this is right

Achieves up to 20x higher throughput than CPU baselines on billion-scale benchmarks.
Delivers up to 17.1x higher throughput than GPU baselines.
Outperforms prior PIM accelerators by 129x in the high-recall regime.
Scales gracefully across multi-node deployments and emerging PIM architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The multiplication-free kernel may reduce energy use in other memory-bound similarity tasks beyond ANNS.
Future PIM hardware could add native support for irregular graph traversals based on the mismatches identified here.
The compacted layout technique might extend to other graph algorithms that suffer from high memory footprint on PIM.

Load-bearing premise

The three components overcome PIM architectural mismatches of small local memory, costly communication, host overhead, and weak compute units without introducing new performance or accuracy bottlenecks.

What would settle it

Running the full design on physical PIM hardware for the three billion-scale benchmarks and measuring whether throughput reaches the reported multiples over CPU, GPU, and prior PIM baselines while recall loss stays below 0.08 percent.

Figures

Figures reproduced from arXiv: 2605.25522 by Amelie Chi Zhou, Jintao Meng, Minwen Deng, Sitian Chen, Yao Chen, Yusen Li.

**Figure 2.** Figure 2: Query processing with greedy beam search. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Commodity PIM architecture. on optimizing the query execution path, as it dominates endto-end ANNS performance, while treating index construction as an offline preprocessing step. By aggressively reducing the cost of distance computation, modern graph-based ANNS algorithms shift their performance bottleneck toward memory access and graph traversal, as confirmed by the roofline analysis in [PITH_FULL_IMAG… view at source ↗

**Figure 4.** Figure 4: Overview of PIMCQG local memory capacity (C1) forces aggressive graph partitioning, which in turn amplifies inter-PU communication during traversal (C2). High communication overhead exacerbates coordination costs and load imbalance across PUs (C3), while restricted PU compute capability (C4) further constrains the choice of distance computation and pruning strategies that could otherwise mitigate these e… view at source ↗

**Figure 5.** Figure 5: Index structure of SymphonyQG and PIMCQG [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 8.** Figure 8: Comparison of Host-PU collaboration strategy. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 7.** Figure 7: Comparison of different scheduling strategies. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 9.** Figure 9: Recall of PIMCQG using node-specific cos(θ) or fixed α on SIFT. value has a convenient binary representation, 1.012, which translates directly into efficient bitwise operations: x · 1.25 ≈ x+(x >> 2). When a specific dataset or cluster deviates from this default, α is calibrated during index construction to the nearest hardware-friendly binary-shift equivalent. As a result, the expensive floating-point sca… view at source ↗

**Figure 10.** Figure 10: QPS vs. recall@10 of compared baselines. Each point is obtained by varying the search-cluster count and [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Energy efficiency (QPS/W) vs. recall@10 for the compared baselines across three datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 14.** Figure 14: Performance breakdown. The overall execution time [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗

**Figure 16.** Figure 16: Throughput comparison of different scheduling strategies SIFT1B SPACEV1B SSN1B 0.00 0.05 0.10 0.15 Time (s) -60.4% -60.8% -49.6% w/o MF w/ MF [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗

**Figure 18.** Figure 18: Multi-node scalability of PIMCQG on SIFT1B. UPMEM PIM-HBM AiM 10 0 10 1 10 2 Speedup vs CPU vs GPU [PITH_FULL_IMAGE:figures/full_fig_p010_18.png] view at source ↗

read the original abstract

Approximate Nearest Neighbor Search (ANNS) is a core primitive in modern AI systems, and graph-based methods currently offer the best accuracy-efficiency trade-off at scale. The workload is fundamentally memory-bound: graph traversal produces frequent, irregular memory accesses that cap CPU throughput at main-memory bandwidth, while GPUs lack the high-bandwidth memory capacity to host billion-scale indexes. Processing-in-Memory (PIM) is a natural candidate, as placing computation next to data unlocks the abundant internal bandwidth that such bandwidth-starved workloads demand. Porting graph-based ANNS to PIM, however, exposes several architectural mismatches: each processing unit has only a small local memory, inter-unit communication is costly, host coordination adds overhead, and in-memory compute units are relatively weak -- limitations that have forced prior PIM-based ANNS designs to fall back on cluster-based indexing, whose recall ceiling is far below that of graph methods. This paper presents an algorithm-architecture co-design that overcomes these obstacles through three components: a compacted index layout that shrinks the PIM-resident memory footprint by 14.5x; an asynchronous pipelined scheduler that keeps the host-to-PIM interconnect saturated; and a multiplication-free distance kernel that loses under 0.08% recall. Across three billion-scale benchmarks, the proposed design achieves up to 20x and 17.1x higher throughput than CPU and GPU baselines, respectively, outperforms prior PIM accelerators by 129x in the high-recall regime, and scales gracefully across multi-node deployments and emerging PIM architecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a concrete three-part co-design that gets graph ANNS running on PIM at billion scale with large reported throughput gains and minimal recall loss.

read the letter

The main thing to know is that this work shows how to adapt graph-based ANNS to PIM without falling back to lower-recall cluster methods. The three pieces are a compacted index layout that cuts the memory footprint by 14.5x, an asynchronous pipelined scheduler to keep the interconnect busy, and a multiplication-free distance kernel that costs under 0.08% recall.

What stands out is the empirical side. They test on three billion-scale benchmarks and report up to 20x throughput over CPU baselines and 17.1x over GPU, plus 129x over prior PIM designs in the high-recall regime. The multi-node scaling results and checks against emerging PIM architectures add practical value.

The soft spots are mostly in how much detail is visible from the abstract alone. The big speedups are stated clearly, but without the full experimental section it is hard to judge the exact baselines, dataset preparation, or whether the recall numbers hold under identical conditions to the competitors. The multiplication-free kernel is a nice trick, but any hidden accuracy trade-offs at scale would need checking in the paper.

This is the sort of systems paper that people working on memory-bound search or PIM hardware will find useful. It gives a workable path rather than just pointing out the problem. I would send it for peer review because the claims are specific enough to evaluate and the co-design is described at a level that lets referees test the logic.

Referee Report

0 major / 1 minor

Summary. The manuscript presents a co-design for graph-based ANNS on PIM at billion scale. It proposes three components to address PIM architectural mismatches: a compacted index layout that reduces the memory footprint by 14.5x, an asynchronous pipelined scheduler to keep the host-to-PIM interconnect saturated, and a multiplication-free distance kernel that incurs less than 0.08% recall loss. The design is evaluated on three billion-scale benchmarks, claiming up to 20x higher throughput than CPU baselines, 17.1x over GPU, and 129x over prior PIM accelerators in the high-recall regime, with graceful scaling in multi-node and emerging PIM setups.

Significance. If the empirical results are robust, this work is significant as it demonstrates how to adapt high-accuracy graph ANNS to PIM hardware, overcoming limitations that previously forced lower-recall cluster-based methods. The throughput gains are substantial for a memory-bound workload, and the co-design approach could influence future PIM software-hardware co-optimization for irregular access patterns in AI systems. The paper provides concrete numbers and addresses scaling, which strengthens the contribution.

minor comments (1)

[Abstract] Abstract: The abstract states specific quantitative claims (20x, 17.1x, 129x throughput; 14.5x footprint; 0.08% recall) without referencing the corresponding evaluation sections, figures, or tables. The manuscript should cross-reference these claims to the experimental results to facilitate verification.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work on co-designing graph-based ANNS for PIM at billion scale, including recognition of the 14.5x compacted index, asynchronous scheduler, and multiplication-free kernel, as well as the throughput gains and scaling results. The recommendation for minor revision is noted, and we will address any editorial or minor issues in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical hardware-software co-design for graph-based ANNS on PIM hardware. It reports measured throughput and recall numbers from three billion-scale benchmarks without any equations, derivations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to inputs by construction. The three components (compacted index, pipelined scheduler, multiplication-free kernel) are described as engineering solutions whose performance is validated externally against CPU/GPU baselines and prior PIM designs; none of the load-bearing results are shown to be tautological or self-referential. This is the expected outcome for a systems paper whose contributions are implementation and measurement rather than mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about PIM hardware properties and graph ANNS workload characteristics; no free parameters or invented entities are identifiable from the abstract.

axioms (1)

domain assumption PIM units have only small local memory, costly inter-unit communication, weak compute, and host coordination overhead that prior designs could not overcome with graph methods.
Explicitly listed in the abstract as the architectural mismatches that forced prior PIM ANNS designs to use lower-recall cluster indexing.

pith-pipeline@v0.9.1-grok · 5824 in / 1353 out tokens · 46205 ms · 2026-06-29T19:48:57.216162+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 50 canonical work pages

[1]

Similarity search in the blink of an eye with compressed indices,

C. Aguerrebere, I. Bhati, M. Hildebrand, M. Tepper, and T. Willke, “Similarity search in the blink of an eye with compressed indices,”arXiv preprint arXiv:2304.04759, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.04759

work page doi:10.48550/arxiv.2304.04759 2023
[2]

Fafnir: Accelerating sparse gathering by using efficient near-memory intelligent reduction,

B. Asgari, R. Hadidi, J. Cao, D. E. Shim, S.-K. Lim, and H. Kim, “Fafnir: Accelerating sparse gathering by using efficient near-memory intelligent reduction,” in2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 908–920. [Online]. Available: https://doi.org/10.1109/HPCA51647. 2021.00080

work page doi:10.1109/hpca51647 2021
[3]

Pimpam: Efficient graph pattern matching on real processing-in-memory hardware,

S. Cai, B. Tian, H. Zhang, and M. Gao, “Pimpam: Efficient graph pattern matching on real processing-in-memory hardware,”Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–25, 2024. [Online]. Available: https://doi.org/10.1145/3654964

work page doi:10.1145/3654964 2024
[4]

Drim-ann: An approximate nearest neighbor search engine based on commercial dram-pims,

M. Chen, T. Han, C. Liu, S. Liang, K. Yu, L. Dai, Z. Yuan, Y . Wang, L. Zhang, H. Liet al., “Drim-ann: An approximate nearest neighbor search engine based on commercial dram-pims,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 820–836. [Online]. Available: https://doi.org/10.1145...

work page doi:10.1145/3712285.3759801 2025
[5]

Finger: Fast inference for graph-based approximate nearest neighbor search,

P. Chen, W.-C. Chang, J.-Y . Jiang, H.-F. Yu, I. Dhillon, and C.-J. Hsieh, “Finger: Fast inference for graph-based approximate nearest neighbor search,” inProceedings of the ACM Web Conference 2023, 2023, pp. 3225–3235. [Online]. Available: https://doi.org/10.1145/3543507. 3583318

work page doi:10.1145/3543507 2023
[6]

Spann: Highly-efficient billion-scale approximate nearest neighborhood search,

Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, M. Yang, and J. Wang, “Spann: Highly-efficient billion-scale approximate nearest neighborhood search,”Advances in Neural Information Processing Systems, vol. 34, pp. 5199–5212, 2021. [Online]. Available: https://doi.org/10.5555/3540261.3540659

work page doi:10.5555/3540261.3540659 2021
[7]

Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,

R. Chen, B. Liu, H. Zhu, Y . Wang, Q. Li, B. Ma, Q. Hua, J. Jiang, Y . Xu, H. Denget al., “Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,” in Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 3013–3022. [Online]. Available: https://doi.org/10.1145/35118...

work page doi:10.1145/3511808.3557098 2022
[8]

Upanns: Enhancing billion-scale anns efficiency with real-world pim architecture,

S. Chen, A. C. Zhou, Y . Shi, Y . Li, and X. Yao, “Upanns: Enhancing billion-scale anns efficiency with real-world pim architecture,” in SC25: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2025, pp. 1–11. [Online]. Available: https://doi.org/10.1145/3712285.3759777

work page doi:10.1145/3712285.3759777 2025
[9]

{PIMLex}: A {High-Performance}learned index with{Processing-in-Memory},

L. Cui, K. Yang, Y . Li, G. Wang, and X. Liu, “{PIMLex}: A {High-Performance}learned index with{Processing-in-Memory},” in 23rd USENIX Conference on File and Storage Technologies (FAST 25), 2025, pp. 287–303. [Online]. Available: https://www.usenix.org/ conference/fast25/presentation/cui

2025
[10]

The true processing in memory accelerator,

F. Devaux, “The true processing in memory accelerator,” in2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 2019, pp. 1–24. [Online]. Available: https://doi.org/10.1109/HOTCHIPS.2019. 8875680

work page doi:10.1109/hotchips.2019 2019
[11]

The journey to a knowledgeable assistant with retrieval-augmented generation (rag),

X. L. Dong, “The journey to a knowledgeable assistant with retrieval-augmented generation (rag),” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 4–4. [Online]. Available: https://doi.org/10.1145/3616855.3638207

work page doi:10.1145/3616855.3638207 2024
[12]

Facebook SimSearchNet++,

Facebook, “Facebook SimSearchNet++,” https://dl.fbaipublicfiles.com/ billion-scale-ann-benchmarks/FB ssnpp database.u8bin, 2026

2026
[13]

Facebook AI Research, “Faiss,” https://github.com/facebookresearch/ faiss
[14]

High dimensional similarity search with satellite system graph: Efficiency, scalability, and unindexed query compatibility,

C. Fu, C. Wang, and D. Cai, “High dimensional similarity search with satellite system graph: Efficiency, scalability, and unindexed query compatibility,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 8, pp. 4139–4150, 2021. [Online]. Available: https://doi.org/10.1109/TPAMI.2021.3067706

work page doi:10.1109/tpami.2021.3067706 2021
[15]

Fast approximate nearest neighbor search with the navigating spreading-out graph,

C. Fu, C. Xiang, C. Wang, and D. Cai, “Fast approximate nearest neighbor search with the navigating spreading-out graph,” arXiv preprint arXiv:1707.00143, 2017. [Online]. Available: https: //doi.org/10.48550/arXiv.1707.00143

work page doi:10.48550/arxiv.1707.00143 2017
[16]

Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,

J. Gao and C. Long, “Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,” Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–27, 2024. [Online]. Available: https://doi.org/10.1145/3654970

work page doi:10.1145/3654970 2024
[17]

Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system,

J. G ´omez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system,” IEEE Access, vol. 10, pp. 52 565–52 608, 2022. [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3174101

work page doi:10.1109/access.2022.3174101 2022
[18]

idec: indexable distance estimating codes for approximate nearest neighbor search,

L. Gong, H. Wang, M. Ogihara, and J. Xu, “idec: indexable distance estimating codes for approximate nearest neighbor search,”Proceedings of the VLDB Endowment, vol. 13, no. 9, 2020. [Online]. Available: https://doi.org/10.14778/3397230.3397243

work page doi:10.14778/3397230.3397243 2020
[19]

Symphonyqg: Towards symphonious integration of quantization and graph for approximate nearest neighbor search,

Y . Gou, J. Gao, Y . Xu, and C. Long, “Symphonyqg: Towards symphonious integration of quantization and graph for approximate nearest neighbor search,”Proceedings of the ACM on Management of Data, vol. 3, no. 1, pp. 1–26, 2025. [Online]. Available: https://doi.org/10.1145/3709730

work page doi:10.1145/3709730 2025
[20]

Ggnn: Graph-based gpu nearest neighbor search,

F. Groh, L. Ruppert, P. Wieschollek, and H. P. Lensch, “Ggnn: Graph-based gpu nearest neighbor search,”IEEE Transactions on Big Data, vol. 9, no. 01, pp. 267–279, 2023. [Online]. Available: https://doi.org/10.1109/TBDATA.2022.3161156

work page doi:10.1109/tbdata.2022.3161156 2023
[21]

Pim is all you need: A cxl-enabled gpu-free system for large language model inference,

Y . Gu, A. Khadem, S. Umesh, N. Liang, X. Servot, O. Mutlu, R. Iyer, and R. Das, “Pim is all you need: A cxl-enabled gpu-free system for large language model inference,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2025, pp. 862–881. [Online]. Available: https://...

work page doi:10.1145/3676641.3716267 2025
[22]

INRIA, “SIFT1B,” http://corpus-texmex.irisa.fr/, 2026

2026
[23]

{CXL- ANNS}:{Software-Hardware}collaborative memory disaggregation and computation for{Billion-Scale}approximate nearest neighbor search,

J. Jang, H. Choi, H. Bae, S. Lee, M. Kwon, and M. Jung, “{CXL- ANNS}:{Software-Hardware}collaborative memory disaggregation and computation for{Billion-Scale}approximate nearest neighbor search,” in2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 585–600. [Online]. Available: https://www.usenix.org/ conference/atc23/presentation/jang

2023
[24]

Diskann: Fast accurate billion-point nearest neighbor search on a single node,

S. Jayaram Subramanya, F. Devvrit, H. V . Simhadri, R. Krishnawamy, and R. Kadekodi, “Diskann: Fast accurate billion-point nearest neighbor search on a single node,”Advances in neural information processing Systems, vol. 32, 2019. [Online]. Available: https: //dl.acm.org/doi/abs/10.5555/3454287.3455520

work page doi:10.5555/3454287.3455520 2019
[25]

Co- design hardware and algorithm for vector search,

W. Jiang, S. Li, Y . Zhu, J. de Fine Licht, Z. He, R. Shi, C. Renggli, S. Zhang, T. Rekatsinas, T. Hoefleret al., “Co- design hardware and algorithm for vector search,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–15. [Online]. Available: https://doi.org/10.1145/3581784.3607045

work page doi:10.1145/3581784.3607045 2023
[26]

Near-memory processing in action: Accelerating personalized recommendation with axdimm,

L. Ke, X. Zhang, J. So, J.-G. Lee, S.-H. Kang, S. Lee, S. Han, Y . Cho, J. H. Kim, Y . Kwonet al., “Near-memory processing in action: Accelerating personalized recommendation with axdimm,” IEEE Micro, vol. 42, no. 1, pp. 116–127, 2021. [Online]. Available: https://doi.org/10.1109/MM.2021.3097700

work page doi:10.1109/mm.2021.3097700 2021
[27]

Bang: Billion-scale approximate nearest neighbor search using a single gpu,

S. Khan, S. Singh, H. V . Simhadri, J. Veduradaet al., “Bang: Billion-scale approximate nearest neighbor search using a single gpu,”arXiv preprint arXiv:2401.11324, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2401.11324

work page doi:10.48550/arxiv.2401.11324 2024
[28]

Accelerating large- scale graph-based nearest neighbor search on a computational storage platform,

J.-H. Kim, Y .-R. Park, J. Do, S.-Y . Ji, and J.-Y . Kim, “Accelerating large- scale graph-based nearest neighbor search on a computational storage platform,”IEEE Transactions on Computers, vol. 72, no. 1, pp. 278–290,
[29]

Available: https://doi.org/10.1109/TC.2022.3155956

[Online]. Available: https://doi.org/10.1109/TC.2022.3155956

work page doi:10.1109/tc.2022.3155956 2022
[30]

{PathWeaver}: A{High-Throughput}{Multi-GPU}system for{Graph-Based}approximate nearest neighbor search,

S. Kim, S. Park, S. U. Noh, J. Hong, T. Kwon, H. Lim, and J. Lee, “{PathWeaver}: A{High-Throughput}{Multi-GPU}system for{Graph-Based}approximate nearest neighbor search,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025, pp. 1501–1517. [Online]. Available: https://www.usenix.org/conference/ atc25/presentation/kim

2025
[31]

Cosmos: A cxl-based full in-memory system for approximate nearest neighbor search,

S. Ko, H. Shim, W. Doh, S. Yun, J. So, Y . Kwon, S.-S. Park, S.-D. Roh, M. Yoon, T. Songet al., “Cosmos: A cxl-based full in-memory system for approximate nearest neighbor search,” IEEE Computer Architecture Letters, 2025. [Online]. Available: https://doi.org/10.1109/LCA.2025.3570235 11

work page doi:10.1109/lca.2025.3570235 2025
[32]

Pimbeam: Efficient regular path queries over graph database using processing-in-memory,

W. Kong, S. Zheng, Y . Hua, R. Ma, Y . Wen, G. Wang, C. Zhou, and L. Huang, “Pimbeam: Efficient regular path queries over graph database using processing-in-memory,”IEEE Transactions on Parallel and Distributed Systems, 2025. [Online]. Available: https://doi.org/10.1109/TPDS.2025.3547365

work page doi:10.1109/tpds.2025.3547365 2025
[33]

System architecture and software stack for gddr6-aim,

Y . Kwon, K. Vladimir, N. Kim, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, G. Kim, B. Anet al., “System architecture and software stack for gddr6-aim,” in2022 IEEE Hot Chips 34 Symposium (HCS). IEEE, 2022, pp. 1–25. [Online]. Available: https://doi.org/10.1109/HCS55958.2022.9895629

work page doi:10.1109/hcs55958.2022.9895629 2022
[34]

25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2 tflops programmable computing unit using bank-level parallelism, for machine learning applications,

Y .-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, O. Seongil, H.-S. Yu, H. Lee, S. Y . Kimet al., “25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2 tflops programmable computing unit using bank-level parallelism, for machine learning applications,” in2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64....

work page doi:10.1109/isscc42613.2021.9365862 2021
[35]

Cost- effective llm accelerator using processing in memory technology,

H. Lee, G. Kim, D. Yun, I. Kim, Y . Kwon, and E. Lim, “Cost- effective llm accelerator using processing in memory technology,” in2024 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 2024, pp. 1–2. [Online]. Available: https://doi.org/10.1109/VLSITechnologyandCir46783.2024.10631397

work page doi:10.1109/vlsitechnologyandcir46783.2024.10631397 2024
[36]

Hnswlib - fast approximate nearest neighbor search

Leonid Boytsov Yury Malkov., “Hnswlib - fast approximate nearest neighbor search.” https://github.com/nmslib/hnswlib
[37]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474,
[38]

Available: https://proceedings.neurips.cc/paper files/ paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf

[Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf

2020
[39]

Pim-dl: Expanding the applicability of commodity dram-pims for deep learning via algorithm-system co-optimization,

C. Li, Z. Zhou, Y . Wang, F. Yang, T. Cao, M. Yang, Y . Liang, and G. Sun, “Pim-dl: Expanding the applicability of commodity dram-pims for deep learning via algorithm-system co-optimization,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 879–896. [Online...

work page doi:10.1145/3620665.3640376 2024
[40]

Scalable graph indexing using gpus for approximate nearest neighbor search,

Z. Li, X. Ke, Y . Zhu, B. Yu, B. Zheng, and Y . Gao, “Scalable graph indexing using gpus for approximate nearest neighbor search,” Proceedings of the ACM on Management of Data, vol. 3, no. 6, pp. 1–27, 2025. [Online]. Available: https://doi.org/10.1145/3769825

work page doi:10.1145/3769825 2025
[41]

Vstore: in-storage graph based vector search accelerator,

S. Liang, Y . Wang, Z. Yuan, C. Liu, H. Li, and X. Li, “Vstore: in-storage graph based vector search accelerator,” inProceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 997–1002. [Online]. Available: https://doi.org/10.1145/3489517.3530560

work page doi:10.1145/3489517.3530560 2022
[42]

Heterrag: Heterogeneous processing-in-memory acceleration for retrieval-augmented generation,

C. Liu, H. Liu, D. Chen, Y . Huang, Y . Zhang, W. Xiao, X. Liao, and H. Jin, “Heterrag: Heterogeneous processing-in-memory acceleration for retrieval-augmented generation,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 884–898. [Online]. Available: https://doi.org/10.1145/3695053.3731089

work page doi:10.1145/3695053.3731089 2025
[43]

Accelerating personalized recommendation with cross-level near-memory processing,

H. Liu, L. Zheng, Y . Huang, C. Liu, X. Ye, J. Yuan, X. Liao, H. Jin, and J. Xue, “Accelerating personalized recommendation with cross-level near-memory processing,” inProceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13. [Online]. Available: https://doi.org/10.1145/3579371.3589101

work page doi:10.1145/3579371.3589101 2023
[44]

Ansel, E

Z. Liu, W. Ni, J. Leng, Y . Feng, C. Guo, Q. Chen, C. Li, M. Guo, and Y . Zhu, “Juno: optimizing high-dimensional approximate nearest neighbour search with sparsity-aware algorithm and ray-tracing core mapping,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, ...

work page doi:10.1145/3620665.3640360 2024
[45]

Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,

Y . A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 4, pp. 824–836, 2018. [Online]. Available: https: //doi.org/10.1109/TPAMI.2018.2889473

work page doi:10.1109/tpami.2018.2889473 2018
[46]

SPACEV1B,

Microsoft, “SPACEV1B,” https://github.com/microsoft/SPTAG/tree/ main/datasets/SPACEV1B, 2026

2026
[47]

InIEEE 40th International Conference on Data Engineering

H. Ootomo, A. Naruse, C. Nolet, R. Wang, T. Feher, and Y . Wang, “Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus,” in2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 2024, pp. 4236–4247. [Online]. Available: https://doi.org/10.1109/icde60146.2024.00323

work page doi:10.1109/icde60146.2024.00323 2024
[48]

Pim-ai: A novel architecture for high-efficiency llm inference,

C. Ortega, Y . Falevoz, and R. Ayrignac, “Pim-ai: A novel architecture for high-efficiency llm inference,”arXiv preprint arXiv:2411.17309,

work page arXiv
[49]

Pim-ai: A novel architecture for high-efficiency llm inference,

[Online]. Available: https://doi.org/10.48550/arXiv.2411.17309

work page doi:10.48550/arxiv.2411.17309
[50]

PIM-HBM,

Samsung, “PIM-HBM,” https://github.com/SAITPublic/PIMSimulator
[51]

Gpu-native approximate nearest neighbor search with ivf-rabitq: Fast index build and search,

J. Shi, J. Gao, J. Xia, T. B. Feh ´er, and C. Long, “Gpu-native approximate nearest neighbor search with ivf-rabitq: Fast index build and search,”arXiv preprint arXiv:2602.23999, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.23999

work page doi:10.48550/arxiv.2602.23999 2026
[52]

SK Hynix, “AiM,” https://github.com/arkhadem/aim simulator
[53]

An efficient fpga implementation of approximate nearest neighbor search,

Y . Song, C. Liu, R. Zhang, D. Zhu, and Z. Wang, “An efficient fpga implementation of approximate nearest neighbor search,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2025. [Online]. Available: https://doi.org/10.1109/TVLSI.2025.3544342

work page doi:10.1109/tvlsi.2025.3544342 2025
[54]

Scalable billion-point approximate nearest neighbor search using {SmartSSDs},

B. Tian, H. Liu, Z. Duan, X. Liao, H. Jin, and Y . Zhang, “Scalable billion-point approximate nearest neighbor search using {SmartSSDs},” in2024 USENIX Annual Technical Conference (USENIX ATC 24), 2024, pp. 1135–1150. [Online]. Available: https://www.usenix.org/conference/atc24/presentation/tian

2024
[55]

Fusionanns: An efficient cpu/gpu cooperative processing architecture for billion-scale approximate nearest neighbor search,

B. Tian, H. Liu, Y . Tang, S. Xiao, Z. Duan, X. Liao, X. Zhang, J. Zhu, and Y . Zhang, “Fusionanns: An efficient cpu/gpu cooperative processing architecture for billion-scale approximate nearest neighbor search,”arXiv preprint arXiv:2409.16576, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2409.16576

work page doi:10.48550/arxiv.2409.16576 2024
[56]

Recommendation of food items for thyroid patients using content-based knn method,

V . S. Vairale and S. Shukla, “Recommendation of food items for thyroid patients using content-based knn method,” inData Science and Security: Proceedings of IDSCS 2020. Springer, 2020, pp. 71–77. [Online]. Available: https://doi.org/10.1007/978-981-15-5309-7 8

work page doi:10.1007/978-981-15-5309-7 2020
[57]

Accelerating graph indexing for anns on modern cpus,

M. Wang, H. Wu, X. Ke, Y . Gao, Y . Zhu, and W. Zhou, “Accelerating graph indexing for anns on modern cpus,”Proceedings of the ACM on Management of Data, vol. 3, no. 3, pp. 1–29, 2025. [Online]. Available: https://doi.org/10.1145/3725260

work page doi:10.1145/3725260 2025
[58]

Ems-i: An efficient memory system design with specialized caching mechanism for recommendation inference,

Y . Wang, S. Li, Q. Zheng, A. Chang, H. Li, and Y . Chen, “Ems-i: An efficient memory system design with specialized caching mechanism for recommendation inference,”ACM Transactions on Embedded Computing Systems, vol. 22, no. 5s, pp. 1–22, 2023. [Online]. Available: https://doi.org/10.1145/3609384

work page doi:10.1145/3609384 2023
[59]

Ndsearch: Accelerating graph-traversal-based approximate nearest neighbor search through near data processing,

Y . Wang, S. Li, Q. Zheng, L. Song, Z. Li, A. Chang, Y . Chen et al., “Ndsearch: Accelerating graph-traversal-based approximate nearest neighbor search through near data processing,” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 368–381. [Online]. Available: https://doi.org/10.1109/ISCA59077.2024.00035

work page doi:10.1109/isca59077.2024.00035 2024
[60]

Turbocharge{ANNS}on real{Processing-in-Memory} by enabling{Fine-Grained}{Per-PIM-Core}scheduling,

P. Wu, M. Xie, E. Zhao, D. Zhang, J. Wang, X. Liang, K. Ren, and Y . Chai, “Turbocharge{ANNS}on real{Processing-in-Memory} by enabling{Fine-Grained}{Per-PIM-Core}scheduling,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025, pp. 1223–1241. [Online]. Available: https://www.usenix.org/conference/ atc25/presentation/wu-puqing

2025
[61]

Proxima: Near-storage acceleration for graph-based approximate nearest neighbor search in 3d nand,

W. Xu, J. Chen, P.-K. Hsu, J. Kang, M. Zhou, S. Pinge, S. Yu, and T. Rosing, “Proxima: Near-storage acceleration for graph-based approximate nearest neighbor search in 3d nand,” IEEE Transactions on Computers, 2026. [Online]. Available: http: //doi.org/10.1109/tc.2026.3671718

work page doi:10.1109/tc.2026.3671718 2026
[62]

Neighborhood Graph and Tree for Indexing High- dimensional Data

Yahoo Japan, “Neighborhood Graph and Tree for Indexing High- dimensional Data.” https://github.com/yahoojapan/NGT
[63]

Df-gas: A distributed fpga-as-a-service architecture towards billion-scale graph-based approximate nearest neighbor search,

S. Zeng, Z. Zhu, J. Liu, H. Zhang, G. Dai, Z. Zhou, S. Li, X. Ning, Y . Xie, H. Yanget al., “Df-gas: A distributed fpga-as-a-service architecture towards billion-scale graph-based approximate nearest neighbor search,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023, pp. 283–296. [Online]. Available: https://doi...

work page doi:10.1145/3613424.3614292 2023
[64]

Nl-dpe: An analog in-memory non-linear dot product engine for efficient cnn and llm inference,

L. Zhao, L. Buonanno, A. Gajjar, J. Moon, A. Natarajan, S. Serebryakov, R. M. Roth, X. Sheng, Y . Zhang, P. Faraboschiet al., “Nl-dpe: An analog in-memory non-linear dot product engine for efficient cnn and llm inference,”arXiv preprint arXiv:2511.13950, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2511.13950

work page doi:10.48550/arxiv.2511.13950 2025
[65]

Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index,

Y . Zheng, Q. Guo, A. K. Tung, and S. Wu, “Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index,” inProceedings of the 2016 International Conference on Management of Data, 2016, pp. 2023–2037. [Online]. Available: https://doi.org/10.1145/2882903.2882930

work page doi:10.1145/2882903.2882930 2016
[66]

Processing-in-hierarchical-memory architecture for billion- scale approximate nearest neighbor search,

Z. Zhu, J. Liu, G. Dai, S. Zeng, B. Li, H. Yang, and Y . Wang, “Processing-in-hierarchical-memory architecture for billion- scale approximate nearest neighbor search,” in2023 60th ACM/IEEE 12 Design Automation Conference (DAC). IEEE, 2023, pp. 1–6. [Online]. Available: https://doi.org/10.1109/DAC56929.2023.10247946

work page doi:10.1109/dac56929.2023.10247946 2023
[67]

Pyglass - Graph Library for Approximate Similarity Search

Zilliz, “Pyglass - Graph Library for Approximate Similarity Search.” https://github.com/zilliztech/pyglass. 13

[1] [1]

Similarity search in the blink of an eye with compressed indices,

C. Aguerrebere, I. Bhati, M. Hildebrand, M. Tepper, and T. Willke, “Similarity search in the blink of an eye with compressed indices,”arXiv preprint arXiv:2304.04759, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.04759

work page doi:10.48550/arxiv.2304.04759 2023

[2] [2]

Fafnir: Accelerating sparse gathering by using efficient near-memory intelligent reduction,

B. Asgari, R. Hadidi, J. Cao, D. E. Shim, S.-K. Lim, and H. Kim, “Fafnir: Accelerating sparse gathering by using efficient near-memory intelligent reduction,” in2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 908–920. [Online]. Available: https://doi.org/10.1109/HPCA51647. 2021.00080

work page doi:10.1109/hpca51647 2021

[3] [3]

Pimpam: Efficient graph pattern matching on real processing-in-memory hardware,

S. Cai, B. Tian, H. Zhang, and M. Gao, “Pimpam: Efficient graph pattern matching on real processing-in-memory hardware,”Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–25, 2024. [Online]. Available: https://doi.org/10.1145/3654964

work page doi:10.1145/3654964 2024

[4] [4]

Drim-ann: An approximate nearest neighbor search engine based on commercial dram-pims,

M. Chen, T. Han, C. Liu, S. Liang, K. Yu, L. Dai, Z. Yuan, Y . Wang, L. Zhang, H. Liet al., “Drim-ann: An approximate nearest neighbor search engine based on commercial dram-pims,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 820–836. [Online]. Available: https://doi.org/10.1145...

work page doi:10.1145/3712285.3759801 2025

[5] [5]

Finger: Fast inference for graph-based approximate nearest neighbor search,

P. Chen, W.-C. Chang, J.-Y . Jiang, H.-F. Yu, I. Dhillon, and C.-J. Hsieh, “Finger: Fast inference for graph-based approximate nearest neighbor search,” inProceedings of the ACM Web Conference 2023, 2023, pp. 3225–3235. [Online]. Available: https://doi.org/10.1145/3543507. 3583318

work page doi:10.1145/3543507 2023

[6] [6]

Spann: Highly-efficient billion-scale approximate nearest neighborhood search,

Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, M. Yang, and J. Wang, “Spann: Highly-efficient billion-scale approximate nearest neighborhood search,”Advances in Neural Information Processing Systems, vol. 34, pp. 5199–5212, 2021. [Online]. Available: https://doi.org/10.5555/3540261.3540659

work page doi:10.5555/3540261.3540659 2021

[7] [7]

Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,

R. Chen, B. Liu, H. Zhu, Y . Wang, Q. Li, B. Ma, Q. Hua, J. Jiang, Y . Xu, H. Denget al., “Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,” in Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 3013–3022. [Online]. Available: https://doi.org/10.1145/35118...

work page doi:10.1145/3511808.3557098 2022

[8] [8]

Upanns: Enhancing billion-scale anns efficiency with real-world pim architecture,

S. Chen, A. C. Zhou, Y . Shi, Y . Li, and X. Yao, “Upanns: Enhancing billion-scale anns efficiency with real-world pim architecture,” in SC25: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2025, pp. 1–11. [Online]. Available: https://doi.org/10.1145/3712285.3759777

work page doi:10.1145/3712285.3759777 2025

[9] [9]

{PIMLex}: A {High-Performance}learned index with{Processing-in-Memory},

L. Cui, K. Yang, Y . Li, G. Wang, and X. Liu, “{PIMLex}: A {High-Performance}learned index with{Processing-in-Memory},” in 23rd USENIX Conference on File and Storage Technologies (FAST 25), 2025, pp. 287–303. [Online]. Available: https://www.usenix.org/ conference/fast25/presentation/cui

2025

[10] [10]

The true processing in memory accelerator,

F. Devaux, “The true processing in memory accelerator,” in2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 2019, pp. 1–24. [Online]. Available: https://doi.org/10.1109/HOTCHIPS.2019. 8875680

work page doi:10.1109/hotchips.2019 2019

[11] [11]

The journey to a knowledgeable assistant with retrieval-augmented generation (rag),

X. L. Dong, “The journey to a knowledgeable assistant with retrieval-augmented generation (rag),” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 4–4. [Online]. Available: https://doi.org/10.1145/3616855.3638207

work page doi:10.1145/3616855.3638207 2024

[12] [12]

Facebook SimSearchNet++,

Facebook, “Facebook SimSearchNet++,” https://dl.fbaipublicfiles.com/ billion-scale-ann-benchmarks/FB ssnpp database.u8bin, 2026

2026

[13] [13]

Facebook AI Research, “Faiss,” https://github.com/facebookresearch/ faiss

[14] [14]

High dimensional similarity search with satellite system graph: Efficiency, scalability, and unindexed query compatibility,

C. Fu, C. Wang, and D. Cai, “High dimensional similarity search with satellite system graph: Efficiency, scalability, and unindexed query compatibility,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 8, pp. 4139–4150, 2021. [Online]. Available: https://doi.org/10.1109/TPAMI.2021.3067706

work page doi:10.1109/tpami.2021.3067706 2021

[15] [15]

Fast approximate nearest neighbor search with the navigating spreading-out graph,

C. Fu, C. Xiang, C. Wang, and D. Cai, “Fast approximate nearest neighbor search with the navigating spreading-out graph,” arXiv preprint arXiv:1707.00143, 2017. [Online]. Available: https: //doi.org/10.48550/arXiv.1707.00143

work page doi:10.48550/arxiv.1707.00143 2017

[16] [16]

Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,

J. Gao and C. Long, “Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,” Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–27, 2024. [Online]. Available: https://doi.org/10.1145/3654970

work page doi:10.1145/3654970 2024

[17] [17]

Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system,

J. G ´omez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system,” IEEE Access, vol. 10, pp. 52 565–52 608, 2022. [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3174101

work page doi:10.1109/access.2022.3174101 2022

[18] [18]

idec: indexable distance estimating codes for approximate nearest neighbor search,

L. Gong, H. Wang, M. Ogihara, and J. Xu, “idec: indexable distance estimating codes for approximate nearest neighbor search,”Proceedings of the VLDB Endowment, vol. 13, no. 9, 2020. [Online]. Available: https://doi.org/10.14778/3397230.3397243

work page doi:10.14778/3397230.3397243 2020

[19] [19]

Symphonyqg: Towards symphonious integration of quantization and graph for approximate nearest neighbor search,

Y . Gou, J. Gao, Y . Xu, and C. Long, “Symphonyqg: Towards symphonious integration of quantization and graph for approximate nearest neighbor search,”Proceedings of the ACM on Management of Data, vol. 3, no. 1, pp. 1–26, 2025. [Online]. Available: https://doi.org/10.1145/3709730

work page doi:10.1145/3709730 2025

[20] [20]

Ggnn: Graph-based gpu nearest neighbor search,

F. Groh, L. Ruppert, P. Wieschollek, and H. P. Lensch, “Ggnn: Graph-based gpu nearest neighbor search,”IEEE Transactions on Big Data, vol. 9, no. 01, pp. 267–279, 2023. [Online]. Available: https://doi.org/10.1109/TBDATA.2022.3161156

work page doi:10.1109/tbdata.2022.3161156 2023

[21] [21]

Pim is all you need: A cxl-enabled gpu-free system for large language model inference,

Y . Gu, A. Khadem, S. Umesh, N. Liang, X. Servot, O. Mutlu, R. Iyer, and R. Das, “Pim is all you need: A cxl-enabled gpu-free system for large language model inference,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2025, pp. 862–881. [Online]. Available: https://...

work page doi:10.1145/3676641.3716267 2025

[22] [22]

INRIA, “SIFT1B,” http://corpus-texmex.irisa.fr/, 2026

2026

[23] [23]

{CXL- ANNS}:{Software-Hardware}collaborative memory disaggregation and computation for{Billion-Scale}approximate nearest neighbor search,

J. Jang, H. Choi, H. Bae, S. Lee, M. Kwon, and M. Jung, “{CXL- ANNS}:{Software-Hardware}collaborative memory disaggregation and computation for{Billion-Scale}approximate nearest neighbor search,” in2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 585–600. [Online]. Available: https://www.usenix.org/ conference/atc23/presentation/jang

2023

[24] [24]

Diskann: Fast accurate billion-point nearest neighbor search on a single node,

S. Jayaram Subramanya, F. Devvrit, H. V . Simhadri, R. Krishnawamy, and R. Kadekodi, “Diskann: Fast accurate billion-point nearest neighbor search on a single node,”Advances in neural information processing Systems, vol. 32, 2019. [Online]. Available: https: //dl.acm.org/doi/abs/10.5555/3454287.3455520

work page doi:10.5555/3454287.3455520 2019

[25] [25]

Co- design hardware and algorithm for vector search,

W. Jiang, S. Li, Y . Zhu, J. de Fine Licht, Z. He, R. Shi, C. Renggli, S. Zhang, T. Rekatsinas, T. Hoefleret al., “Co- design hardware and algorithm for vector search,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–15. [Online]. Available: https://doi.org/10.1145/3581784.3607045

work page doi:10.1145/3581784.3607045 2023

[26] [26]

Near-memory processing in action: Accelerating personalized recommendation with axdimm,

L. Ke, X. Zhang, J. So, J.-G. Lee, S.-H. Kang, S. Lee, S. Han, Y . Cho, J. H. Kim, Y . Kwonet al., “Near-memory processing in action: Accelerating personalized recommendation with axdimm,” IEEE Micro, vol. 42, no. 1, pp. 116–127, 2021. [Online]. Available: https://doi.org/10.1109/MM.2021.3097700

work page doi:10.1109/mm.2021.3097700 2021

[27] [27]

Bang: Billion-scale approximate nearest neighbor search using a single gpu,

S. Khan, S. Singh, H. V . Simhadri, J. Veduradaet al., “Bang: Billion-scale approximate nearest neighbor search using a single gpu,”arXiv preprint arXiv:2401.11324, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2401.11324

work page doi:10.48550/arxiv.2401.11324 2024

[28] [28]

Accelerating large- scale graph-based nearest neighbor search on a computational storage platform,

J.-H. Kim, Y .-R. Park, J. Do, S.-Y . Ji, and J.-Y . Kim, “Accelerating large- scale graph-based nearest neighbor search on a computational storage platform,”IEEE Transactions on Computers, vol. 72, no. 1, pp. 278–290,

[29] [29]

Available: https://doi.org/10.1109/TC.2022.3155956

[Online]. Available: https://doi.org/10.1109/TC.2022.3155956

work page doi:10.1109/tc.2022.3155956 2022

[30] [30]

{PathWeaver}: A{High-Throughput}{Multi-GPU}system for{Graph-Based}approximate nearest neighbor search,

S. Kim, S. Park, S. U. Noh, J. Hong, T. Kwon, H. Lim, and J. Lee, “{PathWeaver}: A{High-Throughput}{Multi-GPU}system for{Graph-Based}approximate nearest neighbor search,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025, pp. 1501–1517. [Online]. Available: https://www.usenix.org/conference/ atc25/presentation/kim

2025

[31] [31]

Cosmos: A cxl-based full in-memory system for approximate nearest neighbor search,

S. Ko, H. Shim, W. Doh, S. Yun, J. So, Y . Kwon, S.-S. Park, S.-D. Roh, M. Yoon, T. Songet al., “Cosmos: A cxl-based full in-memory system for approximate nearest neighbor search,” IEEE Computer Architecture Letters, 2025. [Online]. Available: https://doi.org/10.1109/LCA.2025.3570235 11

work page doi:10.1109/lca.2025.3570235 2025

[32] [32]

Pimbeam: Efficient regular path queries over graph database using processing-in-memory,

W. Kong, S. Zheng, Y . Hua, R. Ma, Y . Wen, G. Wang, C. Zhou, and L. Huang, “Pimbeam: Efficient regular path queries over graph database using processing-in-memory,”IEEE Transactions on Parallel and Distributed Systems, 2025. [Online]. Available: https://doi.org/10.1109/TPDS.2025.3547365

work page doi:10.1109/tpds.2025.3547365 2025

[33] [33]

System architecture and software stack for gddr6-aim,

Y . Kwon, K. Vladimir, N. Kim, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, G. Kim, B. Anet al., “System architecture and software stack for gddr6-aim,” in2022 IEEE Hot Chips 34 Symposium (HCS). IEEE, 2022, pp. 1–25. [Online]. Available: https://doi.org/10.1109/HCS55958.2022.9895629

work page doi:10.1109/hcs55958.2022.9895629 2022

[34] [34]

25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2 tflops programmable computing unit using bank-level parallelism, for machine learning applications,

Y .-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, O. Seongil, H.-S. Yu, H. Lee, S. Y . Kimet al., “25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2 tflops programmable computing unit using bank-level parallelism, for machine learning applications,” in2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64....

work page doi:10.1109/isscc42613.2021.9365862 2021

[35] [35]

Cost- effective llm accelerator using processing in memory technology,

H. Lee, G. Kim, D. Yun, I. Kim, Y . Kwon, and E. Lim, “Cost- effective llm accelerator using processing in memory technology,” in2024 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 2024, pp. 1–2. [Online]. Available: https://doi.org/10.1109/VLSITechnologyandCir46783.2024.10631397

work page doi:10.1109/vlsitechnologyandcir46783.2024.10631397 2024

[36] [36]

Hnswlib - fast approximate nearest neighbor search

Leonid Boytsov Yury Malkov., “Hnswlib - fast approximate nearest neighbor search.” https://github.com/nmslib/hnswlib

[37] [37]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474,

[38] [38]

Available: https://proceedings.neurips.cc/paper files/ paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf

[Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf

2020

[39] [39]

Pim-dl: Expanding the applicability of commodity dram-pims for deep learning via algorithm-system co-optimization,

C. Li, Z. Zhou, Y . Wang, F. Yang, T. Cao, M. Yang, Y . Liang, and G. Sun, “Pim-dl: Expanding the applicability of commodity dram-pims for deep learning via algorithm-system co-optimization,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 879–896. [Online...

work page doi:10.1145/3620665.3640376 2024

[40] [40]

Scalable graph indexing using gpus for approximate nearest neighbor search,

Z. Li, X. Ke, Y . Zhu, B. Yu, B. Zheng, and Y . Gao, “Scalable graph indexing using gpus for approximate nearest neighbor search,” Proceedings of the ACM on Management of Data, vol. 3, no. 6, pp. 1–27, 2025. [Online]. Available: https://doi.org/10.1145/3769825

work page doi:10.1145/3769825 2025

[41] [41]

Vstore: in-storage graph based vector search accelerator,

S. Liang, Y . Wang, Z. Yuan, C. Liu, H. Li, and X. Li, “Vstore: in-storage graph based vector search accelerator,” inProceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 997–1002. [Online]. Available: https://doi.org/10.1145/3489517.3530560

work page doi:10.1145/3489517.3530560 2022

[42] [42]

Heterrag: Heterogeneous processing-in-memory acceleration for retrieval-augmented generation,

C. Liu, H. Liu, D. Chen, Y . Huang, Y . Zhang, W. Xiao, X. Liao, and H. Jin, “Heterrag: Heterogeneous processing-in-memory acceleration for retrieval-augmented generation,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 884–898. [Online]. Available: https://doi.org/10.1145/3695053.3731089

work page doi:10.1145/3695053.3731089 2025

[43] [43]

Accelerating personalized recommendation with cross-level near-memory processing,

H. Liu, L. Zheng, Y . Huang, C. Liu, X. Ye, J. Yuan, X. Liao, H. Jin, and J. Xue, “Accelerating personalized recommendation with cross-level near-memory processing,” inProceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13. [Online]. Available: https://doi.org/10.1145/3579371.3589101

work page doi:10.1145/3579371.3589101 2023

[44] [44]

Ansel, E

Z. Liu, W. Ni, J. Leng, Y . Feng, C. Guo, Q. Chen, C. Li, M. Guo, and Y . Zhu, “Juno: optimizing high-dimensional approximate nearest neighbour search with sparsity-aware algorithm and ray-tracing core mapping,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, ...

work page doi:10.1145/3620665.3640360 2024

[45] [45]

Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,

Y . A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 4, pp. 824–836, 2018. [Online]. Available: https: //doi.org/10.1109/TPAMI.2018.2889473

work page doi:10.1109/tpami.2018.2889473 2018

[46] [46]

SPACEV1B,

Microsoft, “SPACEV1B,” https://github.com/microsoft/SPTAG/tree/ main/datasets/SPACEV1B, 2026

2026

[47] [47]

InIEEE 40th International Conference on Data Engineering

H. Ootomo, A. Naruse, C. Nolet, R. Wang, T. Feher, and Y . Wang, “Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus,” in2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 2024, pp. 4236–4247. [Online]. Available: https://doi.org/10.1109/icde60146.2024.00323

work page doi:10.1109/icde60146.2024.00323 2024

[48] [48]

Pim-ai: A novel architecture for high-efficiency llm inference,

C. Ortega, Y . Falevoz, and R. Ayrignac, “Pim-ai: A novel architecture for high-efficiency llm inference,”arXiv preprint arXiv:2411.17309,

work page arXiv

[49] [49]

Pim-ai: A novel architecture for high-efficiency llm inference,

[Online]. Available: https://doi.org/10.48550/arXiv.2411.17309

work page doi:10.48550/arxiv.2411.17309

[50] [50]

PIM-HBM,

Samsung, “PIM-HBM,” https://github.com/SAITPublic/PIMSimulator

[51] [51]

Gpu-native approximate nearest neighbor search with ivf-rabitq: Fast index build and search,

J. Shi, J. Gao, J. Xia, T. B. Feh ´er, and C. Long, “Gpu-native approximate nearest neighbor search with ivf-rabitq: Fast index build and search,”arXiv preprint arXiv:2602.23999, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.23999

work page doi:10.48550/arxiv.2602.23999 2026

[52] [52]

SK Hynix, “AiM,” https://github.com/arkhadem/aim simulator

[53] [53]

An efficient fpga implementation of approximate nearest neighbor search,

Y . Song, C. Liu, R. Zhang, D. Zhu, and Z. Wang, “An efficient fpga implementation of approximate nearest neighbor search,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2025. [Online]. Available: https://doi.org/10.1109/TVLSI.2025.3544342

work page doi:10.1109/tvlsi.2025.3544342 2025

[54] [54]

Scalable billion-point approximate nearest neighbor search using {SmartSSDs},

B. Tian, H. Liu, Z. Duan, X. Liao, H. Jin, and Y . Zhang, “Scalable billion-point approximate nearest neighbor search using {SmartSSDs},” in2024 USENIX Annual Technical Conference (USENIX ATC 24), 2024, pp. 1135–1150. [Online]. Available: https://www.usenix.org/conference/atc24/presentation/tian

2024

[55] [55]

Fusionanns: An efficient cpu/gpu cooperative processing architecture for billion-scale approximate nearest neighbor search,

B. Tian, H. Liu, Y . Tang, S. Xiao, Z. Duan, X. Liao, X. Zhang, J. Zhu, and Y . Zhang, “Fusionanns: An efficient cpu/gpu cooperative processing architecture for billion-scale approximate nearest neighbor search,”arXiv preprint arXiv:2409.16576, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2409.16576

work page doi:10.48550/arxiv.2409.16576 2024

[56] [56]

Recommendation of food items for thyroid patients using content-based knn method,

V . S. Vairale and S. Shukla, “Recommendation of food items for thyroid patients using content-based knn method,” inData Science and Security: Proceedings of IDSCS 2020. Springer, 2020, pp. 71–77. [Online]. Available: https://doi.org/10.1007/978-981-15-5309-7 8

work page doi:10.1007/978-981-15-5309-7 2020

[57] [57]

Accelerating graph indexing for anns on modern cpus,

M. Wang, H. Wu, X. Ke, Y . Gao, Y . Zhu, and W. Zhou, “Accelerating graph indexing for anns on modern cpus,”Proceedings of the ACM on Management of Data, vol. 3, no. 3, pp. 1–29, 2025. [Online]. Available: https://doi.org/10.1145/3725260

work page doi:10.1145/3725260 2025

[58] [58]

Ems-i: An efficient memory system design with specialized caching mechanism for recommendation inference,

Y . Wang, S. Li, Q. Zheng, A. Chang, H. Li, and Y . Chen, “Ems-i: An efficient memory system design with specialized caching mechanism for recommendation inference,”ACM Transactions on Embedded Computing Systems, vol. 22, no. 5s, pp. 1–22, 2023. [Online]. Available: https://doi.org/10.1145/3609384

work page doi:10.1145/3609384 2023

[59] [59]

Ndsearch: Accelerating graph-traversal-based approximate nearest neighbor search through near data processing,

Y . Wang, S. Li, Q. Zheng, L. Song, Z. Li, A. Chang, Y . Chen et al., “Ndsearch: Accelerating graph-traversal-based approximate nearest neighbor search through near data processing,” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 368–381. [Online]. Available: https://doi.org/10.1109/ISCA59077.2024.00035

work page doi:10.1109/isca59077.2024.00035 2024

[60] [60]

Turbocharge{ANNS}on real{Processing-in-Memory} by enabling{Fine-Grained}{Per-PIM-Core}scheduling,

P. Wu, M. Xie, E. Zhao, D. Zhang, J. Wang, X. Liang, K. Ren, and Y . Chai, “Turbocharge{ANNS}on real{Processing-in-Memory} by enabling{Fine-Grained}{Per-PIM-Core}scheduling,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025, pp. 1223–1241. [Online]. Available: https://www.usenix.org/conference/ atc25/presentation/wu-puqing

2025

[61] [61]

Proxima: Near-storage acceleration for graph-based approximate nearest neighbor search in 3d nand,

W. Xu, J. Chen, P.-K. Hsu, J. Kang, M. Zhou, S. Pinge, S. Yu, and T. Rosing, “Proxima: Near-storage acceleration for graph-based approximate nearest neighbor search in 3d nand,” IEEE Transactions on Computers, 2026. [Online]. Available: http: //doi.org/10.1109/tc.2026.3671718

work page doi:10.1109/tc.2026.3671718 2026

[62] [62]

Neighborhood Graph and Tree for Indexing High- dimensional Data

Yahoo Japan, “Neighborhood Graph and Tree for Indexing High- dimensional Data.” https://github.com/yahoojapan/NGT

[63] [63]

Df-gas: A distributed fpga-as-a-service architecture towards billion-scale graph-based approximate nearest neighbor search,

S. Zeng, Z. Zhu, J. Liu, H. Zhang, G. Dai, Z. Zhou, S. Li, X. Ning, Y . Xie, H. Yanget al., “Df-gas: A distributed fpga-as-a-service architecture towards billion-scale graph-based approximate nearest neighbor search,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023, pp. 283–296. [Online]. Available: https://doi...

work page doi:10.1145/3613424.3614292 2023

[64] [64]

Nl-dpe: An analog in-memory non-linear dot product engine for efficient cnn and llm inference,

L. Zhao, L. Buonanno, A. Gajjar, J. Moon, A. Natarajan, S. Serebryakov, R. M. Roth, X. Sheng, Y . Zhang, P. Faraboschiet al., “Nl-dpe: An analog in-memory non-linear dot product engine for efficient cnn and llm inference,”arXiv preprint arXiv:2511.13950, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2511.13950

work page doi:10.48550/arxiv.2511.13950 2025

[65] [65]

Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index,

Y . Zheng, Q. Guo, A. K. Tung, and S. Wu, “Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index,” inProceedings of the 2016 International Conference on Management of Data, 2016, pp. 2023–2037. [Online]. Available: https://doi.org/10.1145/2882903.2882930

work page doi:10.1145/2882903.2882930 2016

[66] [66]

Processing-in-hierarchical-memory architecture for billion- scale approximate nearest neighbor search,

Z. Zhu, J. Liu, G. Dai, S. Zeng, B. Li, H. Yang, and Y . Wang, “Processing-in-hierarchical-memory architecture for billion- scale approximate nearest neighbor search,” in2023 60th ACM/IEEE 12 Design Automation Conference (DAC). IEEE, 2023, pp. 1–6. [Online]. Available: https://doi.org/10.1109/DAC56929.2023.10247946

work page doi:10.1109/dac56929.2023.10247946 2023

[67] [67]

Pyglass - Graph Library for Approximate Similarity Search

Zilliz, “Pyglass - Graph Library for Approximate Similarity Search.” https://github.com/zilliztech/pyglass. 13