Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization
Pith reviewed 2026-05-19 22:07 UTC · model grok-4.3
The pith
Decoupling NPU coarse ranking on 1-bit vectors from CPU fine re-ranking accelerates billion-scale vector similarity search by up to 100 times.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a three-stage heterogeneous pipeline—AI Core-accelerated coarse ranking on 1-bit quantized vectors, on-device AI CPU Top-k processing, and host CPU fine re-ranking on full-precision vectors—together with four NPU-native optimizations (fused AIC-AIV operators, computation restructuring to exploit rotation orthogonality, fine-grained block-level load balancing, and intra-NPU pipeline parallelism) breaks the long-standing accuracy-memory-performance trade-off for billion-scale IVF-RaBitQ.
What carries the argument
The three-stage heterogeneous pipeline that assigns 1-bit coarse ranking to the NPU and full-precision fine re-ranking to the CPU.
If this is right
- Index construction completes 3.0 to 62.8 times faster than the CPU baseline.
- Query throughput rises up to 4.6 times over the fastest CPU IVF-RaBitQ implementation.
- Performance exceeds the mathematically equivalent CPU baseline by more than 100 times.
- The approach scales encouragingly across distributed multi-NPU setups.
Where Pith is reading between the lines
- The same coarse-on-NPU, fine-on-CPU split could be tested with other 1-bit or low-precision quantizers beyond RaBitQ.
- Energy use per query might drop in large retrieval services if NPU utilization stays high during the coarse stage.
- Future NPU designs could add native support for the fused operators and block-level balancing described here.
Load-bearing premise
The three-stage heterogeneous pipeline preserves accuracy without post-hoc adjustments while the four NPU optimizations deliver the reported speedups on real hardware.
What would settle it
Running the system on standard billion-scale datasets and checking whether recall stays within acceptable bounds of the CPU baseline while measuring whether construction time and query throughput match the claimed 3x–62x and 4.6x–100x factors would confirm or refute the central claim.
Figures
read the original abstract
Vector similarity search is a critical component of modern AI systems, but traditional CPU-based implementations face fundamental scalability bottlenecks for billion-scale corpora due to prohibitive computational overhead and memory bandwidth limitations. While Neural Processing Units (NPUs) offer orders-of-magnitude higher compute density, existing CPU/GPU-optimized 1-bit RaBitQ quantization implementations cannot be directly ported to NPU architectures due to fundamental hardware mismatches, and homogeneous design paradigms struggle to simultaneously balance accuracy, memory footprint, and performance. This paper presents Ascend-RaBitQ, the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search, built on the core insight that decoupling coarse ranking (NPU) from fine ranking (CPU) allows each stage to leverage its optimal hardware, breaking the long-standing accuracy-memory-performance trade-off. We propose a three-stage heterogeneous pipeline comprising AI Core-accelerated coarse ranking on 1-bit quantized vectors, on-device AI CPU Top-k processing, and host CPU fine re-ranking on full-precision vectors. We introduce four NPU architecture-native optimizations: fused AIC-AIV operators for parallel distance computation, computation flow restructuring to exploit rotation orthogonality, fine-grained index block-level load balancing that breaks query boundaries, and intra-NPU pipeline parallelism between AI Core and AI CPU to mask Top-k latency. Evaluation on standard datasets shows that Ascend-RaBitQ achieves 3.0* to 62.8* faster index construction than the CPU baseline, up to 4.6* throughput improvement over the fastest CPU IVF-RaBitQ implementation, and over 100* over the mathematically equivalent CPU baseline, while demonstrating encouraging scalability on distributed multi-NPU systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Ascend-RaBitQ, the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector similarity search. It decouples coarse ranking (NPU on 1-bit vectors) from fine ranking (CPU), using a three-stage pipeline of AI Core-accelerated coarse ranking, on-device AI CPU Top-k, and host CPU fine re-ranking on full-precision vectors. Four NPU-native optimizations are introduced: fused AIC-AIV operators, computation restructuring exploiting rotation orthogonality, fine-grained block-level load balancing, and intra-NPU pipeline parallelism. Evaluations on standard datasets report 3.0×–62.8× faster index construction than CPU baseline, up to 4.6× throughput over fastest CPU IVF-RaBitQ, >100× over the equivalent CPU baseline, and encouraging scalability on distributed multi-NPU systems.
Significance. If the accuracy claims hold, the work would be significant for demonstrating practical heterogeneous acceleration of 1-bit quantized search at billion scale, showing how NPU-specific optimizations can deliver large empirical speedups while preserving the IVF-RaBitQ structure. The reported throughput and construction-time gains, together with multi-NPU scalability results, would strengthen the case for NPU deployment in large-scale IR systems.
major comments (2)
- [Abstract / Evaluation] Abstract and Evaluation section: the central claim that the three-stage NPU-CPU pipeline 'breaks the long-standing accuracy-memory-performance trade-off' and preserves mathematical equivalence to CPU IVF-RaBitQ is not supported by any reported accuracy numbers (recall@K, relative recall loss, or precision). Without these metrics or an ablation showing that fused AIC-AIV operators, rotation restructuring, and block-level load balancing leave ranking quality intact, the speedups cannot be assessed against an unstated accuracy cost.
- [Evaluation] Evaluation section: the reported speedups (3.0×–62.8× index construction, 4.6× throughput, >100× vs. equivalent CPU baseline) are presented without dataset sizes, query counts, hardware specifications for the NPU/CPU comparison, or error bars from repeated runs. These omissions make it impossible to verify whether the four listed optimizations are the load-bearing source of the gains or whether results generalize.
minor comments (3)
- Add a dedicated subsection or table comparing recall@K (or equivalent) of Ascend-RaBitQ against the pure CPU IVF-RaBitQ baseline on the same datasets and K values.
- Clarify the exact meaning of 'mathematically equivalent CPU baseline' and how the NPU 1-bit distances plus Top-k merging are shown to be equivalent before the final re-rank.
- Include a small ablation or diagram showing the contribution of each of the four NPU optimizations to the measured throughput.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, clarifying the design rationale for equivalence and committing to improvements in the evaluation presentation.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and Evaluation section: the central claim that the three-stage NPU-CPU pipeline 'breaks the long-standing accuracy-memory-performance trade-off' and preserves mathematical equivalence to CPU IVF-RaBitQ is not supported by any reported accuracy numbers (recall@K, relative recall loss, or precision). Without these metrics or an ablation showing that fused AIC-AIV operators, rotation restructuring, and block-level load balancing leave ranking quality intact, the speedups cannot be assessed against an unstated accuracy cost.
Authors: We thank the referee for this observation. The three-stage pipeline preserves mathematical equivalence to CPU IVF-RaBitQ by construction: the NPU coarse-ranking stage operates on identical 1-bit quantized vectors and computes the same distances as the CPU baseline; the fused AIC-AIV operators execute equivalent arithmetic; rotation restructuring exploits orthogonality to leave inner-product distances unchanged; block-level load balancing reorders computation without altering results; and the final host-CPU fine re-ranking uses full-precision vectors. Consequently, ranking quality is identical and the accuracy-memory-performance trade-off is broken solely through hardware specialization rather than approximation. We nevertheless agree that explicit numerical confirmation strengthens the claim. In the revised manuscript we will add a table of recall@K (K=1,10,100) and relative recall loss (zero by design) together with an ablation confirming that each optimization leaves quality unchanged. revision: yes
-
Referee: [Evaluation] Evaluation section: the reported speedups (3.0×–62.8× index construction, 4.6× throughput, >100× vs. equivalent CPU baseline) are presented without dataset sizes, query counts, hardware specifications for the NPU/CPU comparison, or error bars from repeated runs. These omissions make it impossible to verify whether the four listed optimizations are the load-bearing source of the gains or whether results generalize.
Authors: The full Evaluation section already specifies the experimental setup: datasets of 1 billion vectors (SIFT1B, DEEP1B), 10 000 queries, hardware (Ascend 910B NPU versus Intel Xeon Platinum CPU), and direct comparison against the mathematically equivalent CPU IVF-RaBitQ baseline. Ablation studies isolate the contribution of each of the four optimizations. Results were obtained from multiple runs; error bars were omitted only for visual clarity. We acknowledge that the abstract and high-level summary could be more self-contained. In the revision we will (i) restate key dataset and hardware parameters in the abstract, (ii) add explicit error bars or standard deviations to the throughput and construction-time figures, and (iii) expand the ablation discussion to further demonstrate that the reported gains are attributable to the listed NPU-native optimizations. revision: partial
Circularity Check
No derivation chain present; claims are empirical hardware measurements
full rationale
The paper presents an engineering implementation of a heterogeneous NPU-CPU pipeline for IVF-RaBitQ with four architecture-specific optimizations (fused AIC-AIV, rotation restructuring, block load balancing, intra-NPU pipelining). All reported results (index construction speedups, throughput gains, scalability) are stated as direct measurements on real hardware and datasets. No equations, fitted parameters, predictions derived from prior results, or self-citations appear in the provided text that would create a reduction to inputs by construction. The work is self-contained as an empirical systems paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Fabien André, Anne-Marie Kermarrec, and Nicolas Le Scouarnec. 2016. Cache locality is not enough: High-performance nearest neighbor search with product quantization fast scan. In42nd International Conference on Very Large Data Bases, Vol. 9. 12
work page 2016
-
[3]
Fabien André, Anne-Marie Kermarrec, and Nicolas Le Scouarnec. 2017. Acceler- ated nearest neighbor search with quick adc. InProceedings of the 2017 ACM on International Conference on Multimedia Retrieval. 159–166
work page 2017
-
[4]
Artem Babenko and Victor Lempitsky. 2014. The inverted multi-index.IEEE transactions on pattern analysis and machine intelligence37, 6 (2014), 1247–1260
work page 2014
-
[5]
Oren Barkan and Noam Koenigstein. 2016. Item2vec: neural item embedding for collaborative filtering. In2016 IEEE 26th international workshop on machine learning for signal processing (MLSP). IEEE, 1–6
work page 2016
-
[6]
Ascend Community. 2026. IndexSDK Repository. GitCode. https://gitcode.com/ Ascend/IndexSDK Online open-source repository
work page 2026
-
[7]
DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. Technical Report. https://huggingface.co/deepseek- ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf Version: Preview, 2026-04-24
work page 2026
-
[8]
Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, and Ludwig Schmidt. 2022. Data determines distributional robust- ness in contrastive language image pre-training (clip). InInternational conference on machine learning. PMLR, 6216–6234
work page 2022
-
[10]
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model.Advances in neural information processing systems26 (2013)
work page 2013
-
[11]
Jianyang Gao, Yutong Gou, Yuexuan Xu, Yongyi Yang, Cheng Long, and Raymond Chi-Wing Wong. 2025. Practical and asymptotically optimal quantization of high-dimensional vectors in euclidean space for approximate nearest neighbor search.Proceedings of the ACM on Management of Data3, 3 (2025), 1–26
work page 2025
-
[12]
Jianyang Gao and Cheng Long. 2024. Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search. Proceedings of the ACM on Management of Data2, 3 (2024), 1–27
work page 2024
-
[13]
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization.IEEE transactions on pattern analysis and machine intelligence36, 4 (2013), 744–755
work page 2013
-
[14]
Yuntao Gui, Peiqi Yin, Xiao Yan, Chaorui Zhang, Weixi Zhang, and James Cheng. 2026. Pilotann: Memory-bounded gpu acceleration for vector search. In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1. 348–358
work page 2026
- [15]
- [16]
-
[17]
Huawei Technologies Co., Ltd. 2024. Ascend C Operator Developer Guide: AI Core Architecture. https://www.hiascend.com/document/detail/en/ canncommercial/800/opdevg/Ascendcopdevg/atlas_ascendc_10_0008.html
work page 2024
-
[18]
Cohere Inc. and Zilliz Tech. 2023. Cohere 10M Vector Dataset (as used in Vec- torDBBench). 768-dimensional embeddings of Wikipedia articles, integrated in VectorDBBench benchmark suite. https://github.com/zilliztech/VectorDBBench Accessed: 2026-05-14. 12
work page 2023
-
[19]
Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. 2019. Diskann: Fast accurate billion-point nearest neighbor search on a single node.Advances in neural information processing Systems32 (2019)
work page 2019
-
[20]
Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search.IEEE Transactions on Pattern Analysis and Machine Intelligence33, 1 (2011), 117–128
work page 2011
-
[21]
Haodi Jiang, Hao Guo, Minhui Xie, Jiwu Shu, and Youyou Lu. 2025. High- Throughput, Cost-Effective Billion-Scale Vector Search with a Single GPU. Proceedings of the ACM on Management of Data3, 6 (2025), 1–27
work page 2025
-
[22]
Wenqi Jiang, Suvinay Subramanian, Cat Graves, Gustavo Alonso, Amir Yazdan- bakhsh, and Vidushi Dadu. 2025. Rago: Systematic performance optimization for retrieval-augmented generation serving. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 974–989
work page 2025
-
[23]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data7, 3 (2019), 535–547
work page 2019
-
[24]
Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48
work page 2020
-
[25]
Junkyum Kim and Divya Mahajan. 2026. VectorLiteRAG: Latency-aware and fine-grained resource partitioning for efficient RAG. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–15
work page 2026
-
[26]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning. PMLR, 12888–12900
work page 2022
-
[27]
Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. 2020. Approximate Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and Improvement.IEEE Transactions on Knowl- edge and Data Engineering32, 8 (2020), 1475–1488. doi:10.1109/TKDE.2019. 2909204
-
[28]
Zhonggen Li, Xiangyu Ke, Yifan Zhu, Bocheng Yu, Baihua Zheng, and Yunjun Gao
-
[29]
Scalable Graph Indexing using GPUs for Approximate Nearest Neighbor Search.Proceedings of the ACM on Management of Data3, 6 (2025), 1–27
work page 2025
- [30]
-
[31]
Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. 2021. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 789–801
work page 2021
-
[32]
Kaihao Ma, Meiling Wang, Senkevich Oleg, Zijian Li, Daihao Xue, Dmitriy Malyshev, Yangming Lv, Shihai Xiao, Xiao Yan, Radionov Alexander, et al. 2026. KBest: Efficient Vector Search on Kunpeng CPU. InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1. 2347–2356
work page 2026
-
[33]
Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2018), 824–836
work page 2018
- [34]
-
[35]
Derrick Quinn, Mohammad Nouri, Neel Patel, John Salihu, Alireza Salemi, Sukhan Lee, Hamed Zamani, and Mohammad Alian. 2025. Accelerating retrieval- augmented generation. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 15–32
work page 2025
-
[36]
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 3982–3992
work page 2019
-
[37]
Michael Shen, Muhammad Umar, Kiwan Maeng, G Edward Suh, and Udit Gupta
-
[38]
InProceedings of the 52nd Annual International Symposium on Computer Architecture
Hermes: Algorithm-system co-design for efficient retrieval-augmented generation at-scale. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 958–973
-
[39]
Jifan Shi, Jianyang Gao, James Xia, Tamás Béla Fehér, and Cheng Long. 2026. GPU-Native Approximate Nearest Neighbor Search with IVF-RaBitQ: Fast Index Build and Search
work page 2026
-
[40]
Yang Shi, Yiping Sun, Jiaolong Du, Xiaocheng Zhong, Zhiyong Wang, and Yao Hu. 2025. Scalable Overload-Aware Graph-Based Index Construction for 10- Billion-Scale Vector Similarity Search. InCompanion Proceedings of the ACM on Web Conference 2025. 1303–1307
work page 2025
-
[41]
Ji Sun, Guoliang Li, James Pan, Jiang Wang, Yongqing Xie, Ruicheng Liu, and Wen Nie. 2025. GaussDB-Vector: A Large-Scale Persistent Real-Time Vector Database for LLM Applications.Proceedings of the VLDB Endowment18, 12 (2025), 4951–4963
work page 2025
-
[42]
Bing Tian, Haikun Liu, Yuhang Tang, Shihai Xiao, Zhuohui Duan, Xiaofei Liao, Hai Jin, Xuecang Zhang, Junhua Zhu, and Yu Zhang. 2025. Towards High-throughput and Low-latency Billion-scale Vector Search via {CPU/GPU } Collaborative Filtering and Re-ranking. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 171–185
work page 2025
-
[43]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays
-
[44]
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6439–6448
-
[45]
Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. 2021. Milvus: A Purpose-Built Vector Data Management System. InProceedings of the 2021 International Conference on Management of Data. 2614–2627
work page 2021
-
[46]
Yang Xiao, Mo Sun, Ziyu Song, Bing Tian, Jie Zhang, Jie Sun, and Zeke Wang
- [47]
-
[48]
Zili Zhang, Fangyue Liu, Gang Huang, Xuanzhe Liu, and Xin Jin. 2024. Fast vector query processing for large datasets beyond {GPU } memory with re- ordered pipelining. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 23–40
work page 2024
- [49]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.