pith. sign in

arxiv: 2511.03298 · v2 · submitted 2025-11-05 · 💻 cs.IR

KScaNN: Scalable Approximate Nearest Neighbor Search on Kunpeng

Pith reviewed 2026-05-18 01:29 UTC · model grok-4.3

classification 💻 cs.IR
keywords approximate nearest neighbor searchARM architectureKunpeng processorvector searchSIMD optimizationproduct quantizationadaptive searchinformation retrieval
0
0 comments X

The pith

KScaNN redesigns approximate nearest neighbor search for Kunpeng ARM servers and achieves up to 1.63 times the speed of the best x86 solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that simply moving existing x86 ANNS code to ARM hardware leaves large speedups on the table because those methods ignore the processor's particular strengths. KScaNN addresses this by pairing higher-level changes such as a hybrid intra-cluster search and better product-quantization residual handling with an ML-based module that tunes parameters on the fly for each query. It further supplies custom SIMD kernels written specifically for the Kunpeng 920 to accelerate the distance calculations that dominate runtime. If these changes deliver the reported gains, vector-search workloads in recommendation systems and retrieval applications can run faster and at lower cost on the growing fleet of ARM servers.

Core claim

KScaNN is a new approximate nearest neighbor search method built from the start for the Kunpeng 920 ARM processor. It introduces a hybrid intra-cluster search strategy together with an improved residual calculation inside product quantization, an ML-driven adaptive search module that chooses parameters per query, and hand-tuned SIMD kernels that make full use of the ARM hardware for distance computations. These pieces together close the performance gap with x86 systems and deliver up to a 1.63x speedup over the fastest existing x86-based solutions.

What carries the argument

A co-designed stack of hybrid intra-cluster search, improved PQ residual calculation, ML-driven per-query adaptive tuning, and Kunpeng-specific SIMD kernels that jointly optimize the algorithm and its low-level execution on ARM hardware.

If this is right

  • Production recommendation and retrieval systems on Kunpeng servers can sustain higher query rates without extra hardware.
  • The adaptive ML module removes the need for static parameter tuning that often wastes compute in live deployments.
  • ARM vector units can be driven to high utilization in distance-heavy workloads when kernels are written for the specific microarchitecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other vector-similarity tasks beyond ANNS may see similar gains if the same adaptive and SIMD co-design pattern is applied on Kunpeng.
  • Cloud operators running mixed ARM and x86 fleets could route vector workloads preferentially to Kunpeng once architecture-specific libraries are available.
  • The results suggest that portable ANNS libraries will need explicit architecture back-ends rather than relying on generic ports for peak performance.

Load-bearing premise

That a direct port of x86 ANNS algorithms to ARM creates a substantial performance deficit that only a hardware-aware redesign can overcome.

What would settle it

Side-by-side timing and recall measurements on the same Kunpeng 920 hardware where a carefully tuned x86 port matches or exceeds KScaNN query latency at equivalent accuracy.

Figures

Figures reproduced from arXiv: 2511.03298 by Alexander Radionov, Daihao Xue, Dmitriy Malyshev, Gleb Neshchetkin, Jan Tabaszewski, Licheng Yu, Meiling Wang, Oleg Senkevich, Qiuling Pan, Siyang Xu, Siyu Huang, Tianyi Jiang, Weidi Zeng, Xin Yao, Yaoyao Fu, Zijian Li.

Figure 1
Figure 1. Figure 1: The pipeline of KScaNN algorithm is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Proportion of queries with 10 NN found for recall 0.99 for SIFT1M [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the top-100 nearest neighbors across the most [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of cluster pruning scenarios. FP: A cluster whose centroid [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sample images from the FASHION-MNIST dataset, characterized [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The hybrid search strategy, which adaptively combines graph [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: The optimized data layout for the NEON LUT16 distance computation. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of the residual vectors, computed with respect to a [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation study showing cumulative QPS improvement as KScaNN features are incrementally enabled. All results are at Recall@10=0.99. For each [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Performance of the hybrid graph-based search on SIFT1M, showing [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pruning points in a cluster using the triangle inequality. Points outside [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
read the original abstract

Approximate Nearest Neighbor Search (ANNS) is a cornerstone algorithm for information retrieval, recommendation systems, and machine learning applications. While x86-based architectures have historically dominated this domain, the increasing adoption of ARM-based servers in industry presents a critical need for ANNS solutions optimized on ARM architectures. A naive port of existing x86 ANNS algorithms to ARM platforms results in a substantial performance deficit, failing to leverage the unique capabilities of the underlying hardware. To address this challenge, we introduce KScaNN, a novel ANNS algorithm co-designed for the Kunpeng 920 ARM architecture. KScaNN embodies a holistic approach that synergizes sophisticated, data aware algorithmic refinements with carefully-designed hardware specific optimizations. Its core contributions include: 1) novel algorithmic techniques, including a hybrid intra-cluster search strategy and an improved PQ residual calculation method, which optimize the search process at a higher level; 2) an ML-driven adaptive search module that provides adaptive, per-query tuning of search parameters, eliminating the inefficiencies of static configurations; and 3) highly-optimized SIMD kernels for ARM that maximize hardware utilization for the critical distance computation workloads. The experimental results demonstrate that KScaNN not only closes the performance gap but establishes a new standard, achieving up to a 1.63x speedup over the fastest x86-based solution. This work provides a definitive blueprint for achieving leadership-class performance for vector search on modern ARM architectures and underscores

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces KScaNN, an approximate nearest neighbor search algorithm co-designed for the Kunpeng 920 ARM architecture. It combines a hybrid intra-cluster search strategy, an improved product quantization residual calculation, an ML-driven adaptive module for per-query parameter tuning, and ARM-specific SIMD kernels. The central experimental claim is that these techniques yield up to a 1.63x speedup over the fastest x86-based ANNS solutions.

Significance. If the performance claims are supported by fair, well-documented comparisons, the work is significant for information retrieval and vector search systems. It provides concrete hardware-aware optimizations for ARM servers, which are seeing increased industrial adoption, and demonstrates the value of combining algorithmic refinements with low-level SIMD work and adaptive ML control. The ML adaptive module in particular offers a reusable idea for reducing static configuration overhead.

major comments (1)
  1. [§5] §5 (Experimental Evaluation) and the associated tables: the 1.63x speedup claim over 'the fastest x86-based solution' is the central result, yet the manuscript supplies no explicit description of the x86 hardware (CPU model, core count, memory configuration) or the optimization effort applied to the ScaNN and Faiss baselines. Without this information it is impossible to determine whether the observed advantage arises from the proposed co-design or from differences in the underlying platforms and tuning parity.
minor comments (2)
  1. [Abstract] The abstract ends abruptly ('underscores').
  2. [§3] Notation for the adaptive module parameters could be introduced earlier and used consistently in the algorithmic description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The feedback on the experimental section is particularly helpful for strengthening the clarity and reproducibility of our performance claims. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§5] §5 (Experimental Evaluation) and the associated tables: the 1.63x speedup claim over 'the fastest x86-based solution' is the central result, yet the manuscript supplies no explicit description of the x86 hardware (CPU model, core count, memory configuration) or the optimization effort applied to the ScaNN and Faiss baselines. Without this information it is impossible to determine whether the observed advantage arises from the proposed co-design or from differences in the underlying platforms and tuning parity.

    Authors: We agree that the current manuscript does not provide sufficient detail on the x86 evaluation platform and baseline tuning. In the revised version we will add a dedicated paragraph (or short subsection) in §5 that explicitly states the x86 server configuration (CPU model, core count, memory capacity and type, and operating system) together with the precise optimization steps applied to the ScaNN and Faiss baselines, including compiler flags, recommended parameter settings from their official repositories, and any architecture-specific adjustments we performed to ensure a fair comparison. This addition will allow readers to assess whether the reported 1.63× advantage stems from our co-design or from platform differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical benchmarks

full rationale

The paper presents a co-designed ANNS implementation with algorithmic refinements (hybrid intra-cluster search, improved PQ residuals, ML adaptive module) and ARM-specific SIMD kernels. Performance claims are established via direct experimental comparison to x86 baselines rather than any mathematical derivation, fitted-parameter prediction, or self-referential equation. No load-bearing step reduces to its own inputs by construction, and no uniqueness theorem or ansatz is imported via self-citation. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view reveals no explicit free parameters, mathematical axioms, or newly postulated entities; the contribution is framed as engineering co-design of existing ANNS building blocks with hardware-specific code.

pith-pipeline@v0.9.0 · 5847 in / 1268 out tokens · 43070 ms · 2026-05-18T01:29:53.081856+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    WebANNS: Fast and efficient approximate nearest neigh- bor search in web browsers,

    M. Liu et al., “WebANNS: Fast and efficient approximate nearest neigh- bor search in web browsers,” InProc. 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’25), Padua, Italy, July 2025, pp. 2483–2492

  2. [2]

    AiSAQ: All-in-storage ANNS with product quanti- zation for DRAM-free information retrieval,

    K. Tatsuno et al., “AiSAQ: All-in-storage ANNS with product quanti- zation for DRAM-free information retrieval,” 2024, arXiv:2404.06004. [Online]. Available: https://arxiv.org/abs/2404.06004

  3. [3]

    Random grids: Fast ap- proximate nearest neighbors and range searching for image search,

    D. Aiger, E. Kokiopoulou, and E. Rivlin, “Random grids: Fast ap- proximate nearest neighbors and range searching for image search,” In Proc. of the 2013 IEEE International Conference on Computer Vision (ICCV’13), Sydney, Australia, December 2013, pp. 3471–3478

  4. [4]

    Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,

    R. Chen et al., “Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,” InProceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM’22), Atlanta, USA, October 2022, pp. 3013–3022

  5. [5]

    Query-aware locality-sensitive hashing for approximate nearest neighbor search,

    Q. Huang, J. Feng, Y . Zhang, Q. Fang, and W. Ng, “Query-aware locality-sensitive hashing for approximate nearest neighbor search,” Proceedings of the VLDB Endowment, vol. 9, no. 1, pp. 1–12, 2015

  6. [6]

    iDEC: indexable distance estimating codes for approximate nearest neighbor search,

    L. Gong, H. Wang, M. Ogihara, and J. Xu, “iDEC: indexable distance estimating codes for approximate nearest neighbor search,”Proceedings of the VLDB Endowment, vol. 13, no. 9, pp. 1483–1497, 2020

  7. [7]

    A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search,

    M. Wang, X. Xu, Q. Yue, and Y . Wang, “A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search,”Proceedings of the VLDB Endowment, vol. 14, no. 11, pp. 1964– 1978, 2021

  8. [8]

    Efficient and robust approxi- mate nearest neighbor search using hierarchical navigable small world graphs,

    Y . A. Malkov and D. A. Yashunin, “Efficient and robust approxi- mate nearest neighbor search using hierarchical navigable small world graphs,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 42, no. 4, pp. 824–836, 2018

  9. [9]

    SOAR: im- proved indexing for approximate nearest neighbor search,

    P. Sun, D. Simcha, D. Dopson, R. Guo, and S. Kumar, “SOAR: im- proved indexing for approximate nearest neighbor search,” InProc. 7th Conference on Neural Information Processing Systems (NeurIPS’23), New Orleans, USA, December 2023, pp. 3189–3204

  10. [10]

    Gemini Embedding: Generalizable Embeddings from Gemini

    J. Lee et al., “Gemini embedding: Generalizable embeddings from gemini,” 2025, arXiv:2503.07891. [Online]. Available: https://arxiv.org/abs/2503.07891

  11. [11]

    Document embeddings for long texts from transformers and autoencoders,

    L. Christou, A. Bompotas, and C. Makris, “Document embeddings for long texts from transformers and autoencoders,” [Online]. Available: http://https://www.researchsquare.com/article/rs-5459822/v1

  12. [12]

    Distributed representations of sentences and documents,

    Q. Le and T. Mikolov, “Distributed representations of sentences and documents,”Proceedings of Machine Learning Research, vol. 32, no. 2. pp. 1188–1196, 2014

  13. [13]

    Revisiting kd-tree for nearest neighbor search,

    P. Ram and K. Sinha, “Revisiting kd-tree for nearest neighbor search,” InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’19), Anchorage, USA, July 2019, pp. 1378–1388

  14. [14]

    Accelerating large-scale inference with anisotropic vector quantization,

    R. Guo et al., “Accelerating large-scale inference with anisotropic vector quantization,” InProc. of the 37th International Conference on Machine Learning (ICML’20), July 2020, pp. 3887–3896

  15. [15]

    Available: https://github.com/google/highway

    [Online]. Available: https://github.com/google/highway

  16. [16]

    ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,

    M. Aum ¨uller, E. Bernhardsson, and A. Faithfull, “ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,” Information Systems, vol. 87, no. 101374, 2020

  17. [17]

    Optimized product quantization for approximate nearest neighbor search,

    T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization for approximate nearest neighbor search,” InProceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13), Portland, USA, June 2013, pp. 2946–2953

  18. [18]

    Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,

    J. Gao and C. Long, “Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,” Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–27, 2024

  19. [19]

    Arm 4-bit pq: Simd-based acceleration for approximate nearest neighbor search on arm,

    Y . Matsui, Y . Imaizumi, N. Miyamoto, and N. Yoshifuji, “Arm 4-bit pq: Simd-based acceleration for approximate nearest neighbor search on arm,” InProc. 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22), Singapore, May 2022, pp. 2080–2084

  20. [20]

    Kunpeng 920 chipset

    HiSilicon. Kunpeng 920 chipset. [Online]. Available: https://www.hisilicon.com/en/products/kunpeng/huawei- kunpeng/huawei-kunpeng-920

  21. [21]

    Kunpeng computing

    Huawei Technologies Ltd. Kunpeng computing. [Online]. Available: https://www.hikunpeng.com/zh

  22. [22]

    Accelerated nearest neighbor search with quick ADC,

    F. Andr ´e, A.-M. Kermarrec, and N.-L. Scouarnec, “Accelerated nearest neighbor search with quick ADC,” InProc. of the 2017 ACM on International Conference on Multimedia Retrieval (ICMR’17), New Yor, USA, June 2017, pp. 159–167

  23. [23]

    KBest: Efficient vector search on Kunpeng CPU,

    M. Kaihao et al., “KBest: Efficient vector search on Kunpeng CPU,” 2025, arXiv:2508.03016. [Online]. Available: https://arxiv.org/abs/2508.03016

  24. [24]

    [Online]

    LightGBM documentation. [Online]. Available: https://lightgbm.readthedocs.io/en/latest/index.html

  25. [25]

    LightGBM: A highly efficient gradient boosting decision tree,

    K. Guolin et al., “LightGBM: A highly efficient gradient boosting decision tree,” InProc. of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, USA, Decem- ber 2017, pp. 3149–3157

  26. [26]

    [Online]

    The Gist dataset. [Online]. Available: http://corpus-texmex.irisa.fr/

  27. [27]

    Efficient indexing of billion-scale datasets of deep descriptors,

    A. Babenko and V . Lempitsky, “Efficient indexing of billion-scale datasets of deep descriptors,” InProceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, USA, June-July 2016, pp. 2055–2063

  28. [28]

    Babenko and D

    A. Babenko and D. Baranchuk. Text-to-Image dataset for billion-scale similarity search. [Online]. Available: https://research.yandex.com/datasets/text-to-image-dataset-for-billion- scale-similarity-search

  29. [29]

    [Online]

    BigANN benchmark: NeurIPS’21 competition track. [Online]. Avail- able: https://big-ann-benchmarks.com/neurips21.html

  30. [30]

    [Online]

    The Mnist database. [Online]. Available: https://en.wikipedia.org/wiki/MNIST database

  31. [31]

    [Online]

    The Fashion mnist dataset. [Online]. Available: https://en.wikipedia.org/wiki/Fashion MNIST

  32. [32]

    [Online]

    The Sift dataset. [Online]. Available: http://corpus-texmex.irisa.fr/

  33. [33]

    Pennington, R

    J. Pennington, R. Socher, and C. Manning. 2014. GloVe: Global vectors for word representation. [Online]. Available: https://nlp.stanford.edu/projects/glove/

  34. [34]

    [Online]

    The ScaNN method. [Online]. Available: https://github.com/google- research/google-research/tree/master/scann

  35. [35]

    [Online]

    The Faiss method. [Online]. Available: https://github.com/facebookresearch/faiss VIII. APPENDIX This appendix details several geometrically motivated data filtration strategies that were investigated during the develop- ment of KScaNN. While these methods demonstrated theo- retical potential for pruning the search space, their computa- tional overhead ult...