KScaNN: Scalable Approximate Nearest Neighbor Search on Kunpeng
Pith reviewed 2026-05-18 01:29 UTC · model grok-4.3
The pith
KScaNN redesigns approximate nearest neighbor search for Kunpeng ARM servers and achieves up to 1.63 times the speed of the best x86 solutions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KScaNN is a new approximate nearest neighbor search method built from the start for the Kunpeng 920 ARM processor. It introduces a hybrid intra-cluster search strategy together with an improved residual calculation inside product quantization, an ML-driven adaptive search module that chooses parameters per query, and hand-tuned SIMD kernels that make full use of the ARM hardware for distance computations. These pieces together close the performance gap with x86 systems and deliver up to a 1.63x speedup over the fastest existing x86-based solutions.
What carries the argument
A co-designed stack of hybrid intra-cluster search, improved PQ residual calculation, ML-driven per-query adaptive tuning, and Kunpeng-specific SIMD kernels that jointly optimize the algorithm and its low-level execution on ARM hardware.
If this is right
- Production recommendation and retrieval systems on Kunpeng servers can sustain higher query rates without extra hardware.
- The adaptive ML module removes the need for static parameter tuning that often wastes compute in live deployments.
- ARM vector units can be driven to high utilization in distance-heavy workloads when kernels are written for the specific microarchitecture.
Where Pith is reading between the lines
- Other vector-similarity tasks beyond ANNS may see similar gains if the same adaptive and SIMD co-design pattern is applied on Kunpeng.
- Cloud operators running mixed ARM and x86 fleets could route vector workloads preferentially to Kunpeng once architecture-specific libraries are available.
- The results suggest that portable ANNS libraries will need explicit architecture back-ends rather than relying on generic ports for peak performance.
Load-bearing premise
That a direct port of x86 ANNS algorithms to ARM creates a substantial performance deficit that only a hardware-aware redesign can overcome.
What would settle it
Side-by-side timing and recall measurements on the same Kunpeng 920 hardware where a carefully tuned x86 port matches or exceeds KScaNN query latency at equivalent accuracy.
Figures
read the original abstract
Approximate Nearest Neighbor Search (ANNS) is a cornerstone algorithm for information retrieval, recommendation systems, and machine learning applications. While x86-based architectures have historically dominated this domain, the increasing adoption of ARM-based servers in industry presents a critical need for ANNS solutions optimized on ARM architectures. A naive port of existing x86 ANNS algorithms to ARM platforms results in a substantial performance deficit, failing to leverage the unique capabilities of the underlying hardware. To address this challenge, we introduce KScaNN, a novel ANNS algorithm co-designed for the Kunpeng 920 ARM architecture. KScaNN embodies a holistic approach that synergizes sophisticated, data aware algorithmic refinements with carefully-designed hardware specific optimizations. Its core contributions include: 1) novel algorithmic techniques, including a hybrid intra-cluster search strategy and an improved PQ residual calculation method, which optimize the search process at a higher level; 2) an ML-driven adaptive search module that provides adaptive, per-query tuning of search parameters, eliminating the inefficiencies of static configurations; and 3) highly-optimized SIMD kernels for ARM that maximize hardware utilization for the critical distance computation workloads. The experimental results demonstrate that KScaNN not only closes the performance gap but establishes a new standard, achieving up to a 1.63x speedup over the fastest x86-based solution. This work provides a definitive blueprint for achieving leadership-class performance for vector search on modern ARM architectures and underscores
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces KScaNN, an approximate nearest neighbor search algorithm co-designed for the Kunpeng 920 ARM architecture. It combines a hybrid intra-cluster search strategy, an improved product quantization residual calculation, an ML-driven adaptive module for per-query parameter tuning, and ARM-specific SIMD kernels. The central experimental claim is that these techniques yield up to a 1.63x speedup over the fastest x86-based ANNS solutions.
Significance. If the performance claims are supported by fair, well-documented comparisons, the work is significant for information retrieval and vector search systems. It provides concrete hardware-aware optimizations for ARM servers, which are seeing increased industrial adoption, and demonstrates the value of combining algorithmic refinements with low-level SIMD work and adaptive ML control. The ML adaptive module in particular offers a reusable idea for reducing static configuration overhead.
major comments (1)
- [§5] §5 (Experimental Evaluation) and the associated tables: the 1.63x speedup claim over 'the fastest x86-based solution' is the central result, yet the manuscript supplies no explicit description of the x86 hardware (CPU model, core count, memory configuration) or the optimization effort applied to the ScaNN and Faiss baselines. Without this information it is impossible to determine whether the observed advantage arises from the proposed co-design or from differences in the underlying platforms and tuning parity.
minor comments (2)
- [Abstract] The abstract ends abruptly ('underscores').
- [§3] Notation for the adaptive module parameters could be introduced earlier and used consistently in the algorithmic description.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The feedback on the experimental section is particularly helpful for strengthening the clarity and reproducibility of our performance claims. We address the major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§5] §5 (Experimental Evaluation) and the associated tables: the 1.63x speedup claim over 'the fastest x86-based solution' is the central result, yet the manuscript supplies no explicit description of the x86 hardware (CPU model, core count, memory configuration) or the optimization effort applied to the ScaNN and Faiss baselines. Without this information it is impossible to determine whether the observed advantage arises from the proposed co-design or from differences in the underlying platforms and tuning parity.
Authors: We agree that the current manuscript does not provide sufficient detail on the x86 evaluation platform and baseline tuning. In the revised version we will add a dedicated paragraph (or short subsection) in §5 that explicitly states the x86 server configuration (CPU model, core count, memory capacity and type, and operating system) together with the precise optimization steps applied to the ScaNN and Faiss baselines, including compiler flags, recommended parameter settings from their official repositories, and any architecture-specific adjustments we performed to ensure a fair comparison. This addition will allow readers to assess whether the reported 1.63× advantage stems from our co-design or from platform differences. revision: yes
Circularity Check
No significant circularity; claims rest on empirical benchmarks
full rationale
The paper presents a co-designed ANNS implementation with algorithmic refinements (hybrid intra-cluster search, improved PQ residuals, ML adaptive module) and ARM-specific SIMD kernels. Performance claims are established via direct experimental comparison to x86 baselines rather than any mathematical derivation, fitted-parameter prediction, or self-referential equation. No load-bearing step reduces to its own inputs by construction, and no uniqueness theorem or ansatz is imported via self-citation. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
WebANNS: Fast and efficient approximate nearest neigh- bor search in web browsers,
M. Liu et al., “WebANNS: Fast and efficient approximate nearest neigh- bor search in web browsers,” InProc. 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’25), Padua, Italy, July 2025, pp. 2483–2492
work page 2025
-
[2]
AiSAQ: All-in-storage ANNS with product quanti- zation for DRAM-free information retrieval,
K. Tatsuno et al., “AiSAQ: All-in-storage ANNS with product quanti- zation for DRAM-free information retrieval,” 2024, arXiv:2404.06004. [Online]. Available: https://arxiv.org/abs/2404.06004
-
[3]
Random grids: Fast ap- proximate nearest neighbors and range searching for image search,
D. Aiger, E. Kokiopoulou, and E. Rivlin, “Random grids: Fast ap- proximate nearest neighbors and range searching for image search,” In Proc. of the 2013 IEEE International Conference on Computer Vision (ICCV’13), Sydney, Australia, December 2013, pp. 3471–3478
work page 2013
-
[4]
Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,
R. Chen et al., “Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,” InProceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM’22), Atlanta, USA, October 2022, pp. 3013–3022
work page 2022
-
[5]
Query-aware locality-sensitive hashing for approximate nearest neighbor search,
Q. Huang, J. Feng, Y . Zhang, Q. Fang, and W. Ng, “Query-aware locality-sensitive hashing for approximate nearest neighbor search,” Proceedings of the VLDB Endowment, vol. 9, no. 1, pp. 1–12, 2015
work page 2015
-
[6]
iDEC: indexable distance estimating codes for approximate nearest neighbor search,
L. Gong, H. Wang, M. Ogihara, and J. Xu, “iDEC: indexable distance estimating codes for approximate nearest neighbor search,”Proceedings of the VLDB Endowment, vol. 13, no. 9, pp. 1483–1497, 2020
work page 2020
-
[7]
M. Wang, X. Xu, Q. Yue, and Y . Wang, “A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search,”Proceedings of the VLDB Endowment, vol. 14, no. 11, pp. 1964– 1978, 2021
work page 1964
-
[8]
Y . A. Malkov and D. A. Yashunin, “Efficient and robust approxi- mate nearest neighbor search using hierarchical navigable small world graphs,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 42, no. 4, pp. 824–836, 2018
work page 2018
-
[9]
SOAR: im- proved indexing for approximate nearest neighbor search,
P. Sun, D. Simcha, D. Dopson, R. Guo, and S. Kumar, “SOAR: im- proved indexing for approximate nearest neighbor search,” InProc. 7th Conference on Neural Information Processing Systems (NeurIPS’23), New Orleans, USA, December 2023, pp. 3189–3204
work page 2023
-
[10]
Gemini Embedding: Generalizable Embeddings from Gemini
J. Lee et al., “Gemini embedding: Generalizable embeddings from gemini,” 2025, arXiv:2503.07891. [Online]. Available: https://arxiv.org/abs/2503.07891
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Document embeddings for long texts from transformers and autoencoders,
L. Christou, A. Bompotas, and C. Makris, “Document embeddings for long texts from transformers and autoencoders,” [Online]. Available: http://https://www.researchsquare.com/article/rs-5459822/v1
-
[12]
Distributed representations of sentences and documents,
Q. Le and T. Mikolov, “Distributed representations of sentences and documents,”Proceedings of Machine Learning Research, vol. 32, no. 2. pp. 1188–1196, 2014
work page 2014
-
[13]
Revisiting kd-tree for nearest neighbor search,
P. Ram and K. Sinha, “Revisiting kd-tree for nearest neighbor search,” InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’19), Anchorage, USA, July 2019, pp. 1378–1388
work page 2019
-
[14]
Accelerating large-scale inference with anisotropic vector quantization,
R. Guo et al., “Accelerating large-scale inference with anisotropic vector quantization,” InProc. of the 37th International Conference on Machine Learning (ICML’20), July 2020, pp. 3887–3896
work page 2020
-
[15]
Available: https://github.com/google/highway
[Online]. Available: https://github.com/google/highway
-
[16]
ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,
M. Aum ¨uller, E. Bernhardsson, and A. Faithfull, “ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,” Information Systems, vol. 87, no. 101374, 2020
work page 2020
-
[17]
Optimized product quantization for approximate nearest neighbor search,
T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization for approximate nearest neighbor search,” InProceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13), Portland, USA, June 2013, pp. 2946–2953
work page 2013
-
[18]
J. Gao and C. Long, “Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,” Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–27, 2024
work page 2024
-
[19]
Arm 4-bit pq: Simd-based acceleration for approximate nearest neighbor search on arm,
Y . Matsui, Y . Imaizumi, N. Miyamoto, and N. Yoshifuji, “Arm 4-bit pq: Simd-based acceleration for approximate nearest neighbor search on arm,” InProc. 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22), Singapore, May 2022, pp. 2080–2084
work page 2022
-
[20]
HiSilicon. Kunpeng 920 chipset. [Online]. Available: https://www.hisilicon.com/en/products/kunpeng/huawei- kunpeng/huawei-kunpeng-920
-
[21]
Huawei Technologies Ltd. Kunpeng computing. [Online]. Available: https://www.hikunpeng.com/zh
-
[22]
Accelerated nearest neighbor search with quick ADC,
F. Andr ´e, A.-M. Kermarrec, and N.-L. Scouarnec, “Accelerated nearest neighbor search with quick ADC,” InProc. of the 2017 ACM on International Conference on Multimedia Retrieval (ICMR’17), New Yor, USA, June 2017, pp. 159–167
work page 2017
-
[23]
KBest: Efficient vector search on Kunpeng CPU,
M. Kaihao et al., “KBest: Efficient vector search on Kunpeng CPU,” 2025, arXiv:2508.03016. [Online]. Available: https://arxiv.org/abs/2508.03016
- [24]
-
[25]
LightGBM: A highly efficient gradient boosting decision tree,
K. Guolin et al., “LightGBM: A highly efficient gradient boosting decision tree,” InProc. of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, USA, Decem- ber 2017, pp. 3149–3157
work page 2017
- [26]
-
[27]
Efficient indexing of billion-scale datasets of deep descriptors,
A. Babenko and V . Lempitsky, “Efficient indexing of billion-scale datasets of deep descriptors,” InProceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, USA, June-July 2016, pp. 2055–2063
work page 2016
-
[28]
A. Babenko and D. Baranchuk. Text-to-Image dataset for billion-scale similarity search. [Online]. Available: https://research.yandex.com/datasets/text-to-image-dataset-for-billion- scale-similarity-search
- [29]
- [30]
- [31]
- [32]
-
[33]
J. Pennington, R. Socher, and C. Manning. 2014. GloVe: Global vectors for word representation. [Online]. Available: https://nlp.stanford.edu/projects/glove/
work page 2014
- [34]
-
[35]
The Faiss method. [Online]. Available: https://github.com/facebookresearch/faiss VIII. APPENDIX This appendix details several geometrically motivated data filtration strategies that were investigated during the develop- ment of KScaNN. While these methods demonstrated theo- retical potential for pruning the search space, their computa- tional overhead ult...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.