pith. sign in

arxiv: 2605.15957 · v1 · pith:FCAPCZUOnew · submitted 2026-05-15 · 💻 cs.DB

To GPU or Not to GPU: Vector Search in Relational Engines

Pith reviewed 2026-05-19 19:01 UTC · model grok-4.3

classification 💻 cs.DB
keywords vector searchGPU accelerationrelational databasesTPC-H benchmarkSQL queriesembeddingsindex optimizationdatabase engines
0
0 comments X

The pith

An alternative organization of vector indexes and embeddings lets GPUs accelerate both relational queries and vector search in database engines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether vector search should move to GPUs inside relational database engines, since GPUs dominate AI workloads but databases remain CPU-based. It extends the TPC-H benchmark with vector data from text and images, creates representative SQL-plus-vector queries, and builds a modular execution engine that can dispatch work to CPU or GPU. Experiments across memory locations, index types, GPUs, and interconnects show that relational operations gain more from the GPU than vector search does, and that moving full indexes and embeddings to the GPU is often slower. By reorganizing the vector index and embeddings to shrink their footprint, the design reverses this outcome so that both relational and vector-search components run faster on the GPU, especially over fast interconnects such as NVLink.

Core claim

With an alternative organization of vector index and embeddings that reduces index size, both the relational and vector search components are faster on the GPU, particularly on fast interconnects, in contrast with the architecture used in existing engines.

What carries the argument

Alternative organization of vector index and embeddings that reduces the size of the index, allowing GPU execution of SQL+VS queries without the data-movement penalty of conventional designs.

If this is right

  • Relational components of SQL+VS queries benefit more from GPU execution than the vector-search component itself.
  • Moving existing vector indexes and embeddings to the GPU is not the best option even with fast interconnects.
  • Reducing index size through reorganization makes GPU-based vector search competitive with CPU versions.
  • Both relational and vector-search parts become faster on GPU than on CPU when the smaller index is used, especially over fast interconnects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Database architects may need to treat vector indexes as first-class GPU-resident structures rather than CPU-first objects that are occasionally copied.
  • The same reorganization technique could be tested on other vector-search workloads outside TPC-H to check whether the size reduction generalizes.
  • Future engines might expose the choice of index layout as a tunable parameter so users can trade index size for GPU acceleration.

Load-bearing premise

The modular execution engine accurately models the overheads and integration costs that would appear in a production relational database engine when adding GPU vector search support.

What would settle it

A production implementation of the optimized index inside an actual database engine that still shows higher end-to-end latency on GPU than on CPU even with NVLink.

Figures

Figures reproduced from arXiv: 2605.15957 by Bowen Wu, Gustavo Alonso, Joel Andr\'e, Marko Kabi\'c, Vasilis Mageirakos, Yannis Chronis.

Figure 2
Figure 2. Figure 2: Vec-H query reference implementations in MaxVec. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Vector search operators in MaxVec. Exhaustive (left) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Vec-H per-query runtime with owning indexes ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Share of total 𝑐𝑝𝑢 to 𝑔𝑝𝑢 wall-time savings attribut￾able to relational operators. benefit when the data is pre-resident on GPU. In [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Vec-H per-query runtime under hybrid execution (VS on CPU, Rel on GPU). The [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Vec-H per-query runtime on GH200-NVLink under the four optimized execution strategies ( [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Vector search operator runtime on the reviews [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-query Vec-H runtime on DGX-Spark and GH200 for the [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

Vector search (VS) is now available in most database engines. However, while vector search is a common feature in AI/ML/LLMs where the dominant computing platforms are GPUs, existing database engines operate on CPUs even when implementing vector search. This raises the question of whether integrating vector processing on GPUs as part of the engine would be a better design. In this paper, we explore this question in detail. First, we extend the TPC-H benchmark with vector data (from text and images) and propose a number of representative SQL+VS queries. Second, we develop a modular execution engine that can run SQL+VS queries across CPU and GPU. Third, we perform extensive experiments on a number of deployments: running the SQL+VS queries across CPU and/or GPU, with data residing in CPU or GPU memory, with existing indices and novel, optimized versions, as well as across different GPUs and interconnects (PCIe, NVLink). The results provide actionable and counter-intuitive insights on how to run such queries over CPUs and GPUs. For instance, the relational components benefit much more from running on the GPU than the vector search part. In addition, when the vector search involves moving data and indexes, using the GPU is not the best option, even with fast interconnects. Thus, we develop an alternative organization of vector index and embeddings that reduces the size of the index, making GPU-based vector search more competitive. With these improvements, the final result is that both the relational and vector search components are faster on the GPU, particularly on fast interconnects, in contrast with the architecture used in existing engines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates whether GPU-based vector search should be integrated into relational database engines, which currently rely on CPUs. It extends TPC-H with vector data from text and images, defines representative SQL+VS queries, builds a modular execution engine supporting CPU/GPU execution with varying data placements and interconnects (PCIe, NVLink), and evaluates existing and novel index organizations. The central claim is that an alternative vector index/embedding layout reducing index size makes both relational and vector-search components faster on GPU than CPU, especially on fast interconnects, in contrast to architectures in existing engines.

Significance. If the results hold, the work provides actionable guidance for hybrid SQL+vector workloads in AI/ML contexts by quantifying when GPU acceleration benefits relational components more than vector search itself and by demonstrating a size-reduced index organization that improves GPU competitiveness. Strengths include the broad experimental matrix across hardware, data locations, and index variants, plus direct measurements rather than model-derived claims.

major comments (2)
  1. [Modular execution engine description and experimental setup] The central performance claims rest on a custom modular execution engine whose fidelity to production relational engine costs is not demonstrated. Query optimizer extensions, cost-model integration, buffer-pool interactions, transaction/concurrency semantics, and data-movement consistency checks are omitted; if these costs are material, the reported GPU advantages with the reduced-size index may not hold in a real deployment such as PostgreSQL.
  2. [Results and index organization sections] The paper should quantify the index-size reduction achieved by the alternative organization and show its effect on data-movement volume and query plans; without these measurements it is difficult to isolate whether the reported speedups are due to the new layout or to other experimental factors.
minor comments (2)
  1. [Benchmark and query definitions] Clarify the exact set of SQL+VS queries used and whether they are representative of production vector workloads beyond TPC-H extensions.
  2. [Experimental results] Add statistical significance tests or confidence intervals for the reported performance differences across configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major comment below with explanations and indicate where revisions will be made to improve the manuscript.

read point-by-point responses
  1. Referee: The central performance claims rest on a custom modular execution engine whose fidelity to production relational engine costs is not demonstrated. Query optimizer extensions, cost-model integration, buffer-pool interactions, transaction/concurrency semantics, and data-movement consistency checks are omitted; if these costs are material, the reported GPU advantages with the reduced-size index may not hold in a real deployment such as PostgreSQL.

    Authors: We thank the referee for this observation. Our modular execution engine is a research prototype constructed specifically to isolate and directly measure the execution costs of relational operators and vector search on CPU versus GPU across controlled data placements and interconnects. This design choice enables precise attribution of performance differences to hardware and layout factors without the overheads of a full production stack. We acknowledge that a complete integration into a system such as PostgreSQL would introduce additional costs from query optimization, buffer-pool management, concurrency control, and consistency mechanisms that are outside the current scope. In the revised manuscript we will expand the experimental-setup section to explicitly discuss these limitations and their possible influence on generalizability, thereby clarifying the boundaries of our claims while retaining the value of the measured trade-offs. revision: partial

  2. Referee: The paper should quantify the index-size reduction achieved by the alternative organization and show its effect on data-movement volume and query plans; without these measurements it is difficult to isolate whether the reported speedups are due to the new layout or to other experimental factors.

    Authors: We agree that explicit quantification of the index-size reduction is needed to strengthen attribution of the observed speedups. The alternative organization reduces index size by co-locating compact embeddings with a pruned index structure, which directly lowers the volume of data transferred over the interconnect. In the revised version we will report concrete index sizes (in absolute terms and as percentage reduction) for both the baseline and proposed organizations, present measured data-movement volumes for representative queries, and describe how the smaller footprint alters execution plans within our modular engine. These additions will make it clearer that the performance gains stem from the reduced data movement enabled by the new layout. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct experimental measurements

full rationale

The paper conducts an empirical study: it extends TPC-H with vector data, builds a modular execution engine, and reports measured runtimes for SQL+VS queries across CPU/GPU, memory placements, indices, and interconnects. The central claim (alternative index/embedding organization improves GPU performance) follows from these measurements rather than any equation, fitted parameter, or self-citation that reduces the outcome to its own inputs by construction. No load-bearing derivation step collapses to a prior result or definition; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the custom modular engine faithfully represents production database behavior and that the chosen TPC-H vector extensions are representative of real workloads. No free parameters are fitted in the reported results; the work is purely empirical.

axioms (1)
  • domain assumption The modular execution engine accurately captures the integration and data-movement costs of a full relational database engine.
    Invoked when interpreting all CPU/GPU performance differences as representative of what a production system would experience.

pith-pipeline@v0.9.0 · 5839 in / 1290 out tokens · 34630 ms · 2026-05-19T19:01:07.638223+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 3 internal anchors

  1. [1]

    DuckDB Vector Similarity Search (VSS) Extension

    2024. DuckDB Vector Similarity Search (VSS) Extension. https://github.com/ duckdb/duckdb-vss. Accessed: 2026-05-15

  2. [2]

    Apache Software Foundation. 2026. Apache Arrow: A Cross-Language Devel- opment Platform for In-Memory Data. https://arrow.apache.org/. Accessed: 12 2026-04-29

  3. [3]

    Felipe Aramburú, William Malpica, Kaouther Abrougui, Amin Aramoon, Ro- mulo Auccapuclla, Claude Brisson, Matthijs Brobbel, Colby Farrell, Pradeep Garigipati, Joost Hoozemans, et al. 2025. Theseus: A Distributed and Scalable GPU-Accelerated Query Processing Platform Optimized for Efficient Data Move- ment.arXiv preprint arXiv:2508.05029(2025)

  4. [4]

    Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2020. ANN- Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems87 (2020), 101374. https://doi.org/10.1016/j.is.2019.02.006

  5. [5]

    David Boehme, Todd Gamblin, David Beckingsale, Peer-Timo Bremer, Alfredo Gimenez, Matthew LeGendre, Olga Pearce, and Martin Schulz. 2016. Caliper: performance introspection for HPC software stacks. InProceedings of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis(Salt Lake City, Utah)(SC ’16). IEEE Press, Art...

  6. [6]

    Cheng Chen, Chenzhe Jin, Yunan Zhang, Sasha Podolsky, Chun Wu, Szu- Po Wang, Eric Hanson, Zhou Sun, Robert Walzer, and Jianguo Wang. 2024. SingleStore-V: An Integrated Vector Database System in SingleStore.Proc. VLDB Endow.17, 12 (Aug. 2024), 3772–3785. https://doi.org/10.14778/3685800.3685805

  7. [7]

    Yannis Chronis, Helena Caminal, Yannis Papakonstantinou, Fatma Özcan, and Anastasia Ailamaki. 2025. Filtered Vector Search: State-of-the-Art and Research Opportunities.Proc. VLDB Endow.18, 12 (Aug. 2025), 5488–5492. https://doi. org/10.14778/3750601.3750700

  8. [8]

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2026. The Faiss Library.IEEE Transactions on Big Data12, 2 (2026), 346–361. https: //doi.org/10.1109/TBDATA.2025.3618474

  9. [9]

    Luigi Fusco, Mikhail Khalilov, Marcin Chrapek, Giridhar Chukkapalli, Thomas Schulthess, and Torsten Hoefler. 2024. Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip. arXiv preprint arXiv:2408.11556(2024)

  10. [10]

    Google Cloud. 2025. ScaNN for AlloyDB. https://services.google.com/fh/files/ misc/scann_for_alloydb_whitepaper.pdf. Accessed: 2026-05-15

  11. [11]

    Mark Harris. 2012. How to Optimize Data Transfers in CUDA C/C++. NVIDIA Technical Blog. https://developer.nvidia.com/blog/how-optimize-data-transfers- cuda-cc/ Accessed: 2026-04-30

  12. [12]

    Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley

  13. [13]

    Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952(2024)

  14. [14]

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs.IEEE Transactions on Big Data7, 3 (2021), 535–547. https: //doi.org/10.1109/TBDATA.2019.2921572

  15. [15]

    Marko Kabić, Shriram Chandran, and Gustavo Alonso. 2025. Maximus: A Modular Accelerated Query Engine for Data Analytics on Heterogeneous Systems.Proc. ACM Manag. Data3, 3, Article 187 (June 2025), 25 pages. https://doi.org/10.1145/ 3725324

  16. [16]

    Marko Kabić, Bowen Wu, Jonas Dann, and Gustavo Alonso. 2025. Powerful GPUs or Fast Interconnects: Analyzing Relational Workloads on Modern GPUs.Proc. VLDB Endow.18, 11 (July 2025), 4350–4363. https://doi.org/10.14778/3749646. 3749698

  17. [17]

    Andrew Kane et al. 2025. pgvector: Open-Source Vector Similarity Search for Postgres. https://github.com/pgvector/pgvector. Accessed: 2026-05-15

  18. [18]

    Guoxin Kang, Zhongxin Ge, Jingpei Hu, Xueya Zhang, Lei Wang, and Jianfeng Zhan. 2025. BigVectorBench: Heterogeneous Data Embedding and Compound Queries are Essential in Evaluating Vector Databases.Proc. VLDB Endow.18, 5 (Jan. 2025), 1536–1550. https://doi.org/10.14778/3718057.3718078

  19. [19]

    Hyunjoon Kim, Chaerim Lim, Hyeonjun An, Rathijit Sen, and Kwanghyun Park

  20. [20]

    Exqutor: Extended Query Optimizer for Vector-augmented Analytical Queries.arXiv preprint arXiv:2512.09695(2025)

  21. [21]

    Jiale Lao, Andreas Zimmerer, Olga Ovcharenko, Tianji Cong, Matthew Russo, Gerardo Vitagliano, Michael Cochez, Fatma Özcan, Gautam Gupta, Thibaud Hottelier, et al. 2025. SemBench: A Benchmark for Semantic Query Processing Engines.arXiv preprint arXiv:2511.01716(2025)

  22. [22]

    Yaowen Liu, Xuejia Chen, Anxin Tian, Haoyang Li, Qinbin Li, Xin Zhang, Alexan- der Zhou, Chen Jason Zhang, Qing Li, and Lei Chen. 2026. GPU-Accelerated Algorithms for Graph Vector Search: Taxonomy, Empirical Study, and Research Directions.arXiv preprint arXiv:2602.16719(2026)

  23. [23]

    Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl

  24. [24]

    InProceedings of the 2020 ACM SIGMOD International Conference on Management of Data(Portland, OR, USA)(SIGMOD ’20)

    Pump Up the Volume: Processing Large Data on GPUs with Fast Inter- connects. InProceedings of the 2020 ACM SIGMOD International Conference on Management of Data(Portland, OR, USA)(SIGMOD ’20). Association for Comput- ing Machinery, New York, NY, USA, 1633–1649. https://doi.org/10.1145/3318464. 3389705

  25. [25]

    Vasilis Mageirakos, Bowen Wu, and Gustavo Alonso. 2025. Cracking Vector Search Indexes.Proc. VLDB Endow.18, 11 (July 2025), 3951–3964. https://doi. org/10.14778/3749646.3749666

  26. [26]

    Malkov and D

    Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–

  27. [27]

    https://doi.org/10.1109/TPAMI.2018.2889473

  28. [28]

    Meta AI Research. 2025. FAISS v1.13.0, gpu/impl/IndexUtils.cu: getMaxKSelection. https://github.com/facebookresearch/faiss/blob/v1.13.0/ faiss/gpu/impl/IndexUtils.cu. Accessed 2026-04-13

  29. [29]

    Chenghao Mo, Ben Karsin, Philip Adams, and Minjia Zhang. 2026. VecFlow- Chamfer: A GPU-based Data Management System for High-Performance Multi- Vector Search on Superchips.Proc. ACM Manag. Data4, 1, Article 92 (April 2026), 26 pages. https://doi.org/10.1145/3786706

  30. [30]

    Hubert Mohr-Daurat, Xuan Sun, and Holger Pirk. 2023. BOSS - An Architecture for Database Kernel Composition.Proc. VLDB Endow.17, 4 (Dec. 2023), 877–890. https://doi.org/10.14778/3636218.3636239

  31. [31]

    NVIDIA. 2026. CUDA C++ Programming Guide: Full Unified Memory with Hardware Coherency. https://docs.nvidia.com/cuda/cuda-programming- guide/02-basics/understanding-memory.html#full-unified-memory-with- hardware-coherency. Accessed 2026-04-29

  32. [32]

    NVIDIA Corporation. 2023. Matrix Multiplication Background User’s Guide. https://docs.nvidia.com/deeplearning/performance/dl-performance- matrix-multiplication/. NVIDIA Deep Learning Performance Documentation. Accessed: 2026-04-29

  33. [33]

    NVIDIA Corporation. 2024. NVIDIA Grace Hopper Superchip. https://www. nvidia.com/en-us/data-center/grace-hopper-superchip/. Accessed: 2026-04-29

  34. [34]

    NVIDIA Corporation. 2025. NVIDIA DGX Spark Datasheet. https: //nvdam.widen.net/s/tlzm8smqjx/workstation-datasheet-dgx-spark-gtc25- spring-nvidia-us-3716899-web. GTC 2025 Spring. Accessed: 2026-05-01

  35. [35]

    NVIDIA Corporation. 2026. NVIDIA Nsight Systems. https://developer.nvidia. com/nsight-systems. Accessed: 2026-04-29

  36. [36]

    Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, and Yong Wang. 2024. CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs. In2024 IEEE 40th International Conference on Data Engineering (ICDE). 4236–4247. https://doi.org/10.1109/ICDE60146.2024. 00323

  37. [37]

    Oracle Corporation. 2025. Oracle AI Vector Search User’s Guide. https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/ai- vector-search-users-guide.pdf. Accessed: 2026-05-15

  38. [38]

    Pinecone. 2025. Pinecone: The Vector Database for AI Search and Retrieval. https://www.pinecone.io/. Accessed: 2026-05-15

  39. [39]

    Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an Embeddable Analytical Database. InProceedings of the 2019 International Conference on Management of Data(Amsterdam, Netherlands)(SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1981–1984. https://doi.org/10.1145/3299869. 3320212

  40. [40]

    RAPIDS Development Team. 2026. cuDF: A GPU DataFrame Library. https: //github.com/rapidsai/cudf. NVIDIA RAPIDS

  41. [41]

    RAPIDS Development Team. 2026. cuVS: Vector Search and Clustering on the GPU. https://github.com/rapidsai/cuvs. NVIDIA RAPIDS

  42. [42]

    RAPIDS Development Team. 2026. RMM: RAPIDS Memory Manager. https: //github.com/rapidsai/rmm. NVIDIA RAPIDS

  43. [43]

    Silva, Walid G

    Yasin N. Silva, Walid G. Aref, and Mohamed H. Ali. 2010. The similarity join database operator. In2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). 892–903. https://doi.org/10.1109/ICDE.2010.5447873

  44. [44]

    Josef Sivic and Andrew Zisserman. 2003. Video Google: A Text Retrieval Ap- proach to Object Matching in Videos. InProceedings of the Ninth IEEE Inter- national Conference on Computer Vision - Volume 2 (ICCV ’03). IEEE Computer Society, USA, 1470

  45. [45]

    Michael Stonebraker and Andrew Pavlo. 2024. What Goes Around Comes Around... And Around...SIGMOD Rec.53, 2 (July 2024), 21–37. https://doi.org/ 10.1145/3685980.3685984

  46. [46]

    Ji Sun, Guoliang Li, James Pan, Jiang Wang, Yongqing Xie, Ruicheng Liu, and Wen Nie. 2025. GaussDB-Vector: A Large-Scale Persistent Real-Time Vector Database for LLM Applications.Proc. VLDB Endow.18, 12 (Aug. 2025), 4951–4963. https://doi.org/10.14778/3750601.3750619

  47. [47]

    2022.TPC Benchmark H (Deci- sion Support) Standard Specification

    Transaction Processing Performance Council. 2022.TPC Benchmark H (Deci- sion Support) Standard Specification. Technical Report. Transaction Processing Performance Council (TPC). https://www.tpc.org/TPC_Documents_Current_ Versions/pdf/TPC-H_v3.0.1.pdf Version 3.0.1, Accessed: 2026-05-15

  48. [48]

    Transaction Processing Performance Council. 2024. TPC Benchmark DS (TPC- DS) Standard Specification. https://www.tpc.org/tpcds/. Version 4.0.0, Accessed: 2026-04-29

  49. [49]

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. 2025. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786(2025)

  50. [50]

    Nitish Upreti, Harsha Vardhan Simhadri, Hari Sudan Sundar, Krishnan Sundaram, Samer Boshra, Balachandar Perumalswamy, Shivam Atri, Martin Chisholm, Revti Raman Singh, Greg Yang, Tamara Hass, Nitesh Dudhey, Subramanyam Pattipaka, Mark Hildebrand, Magdalen Manohar, Jack Moffitt, Haiyang Xu, Naren 13 Datha, Suryansh Gupta, Ravishankar Krishnaswamy, Prashant ...

  51. [51]

    Karthik Venkatasubba, Saim Khan, Somesh Singh, Harsha Vardhan Simhadri, and Jyothi Vedurada. 2025. BANG: Billion-Scale Approximate Nearest Neighbour Search Using a Single GPU.IEEE Transactions on Big Data11, 6 (2025), 3142–3157. https://doi.org/10.1109/TBDATA.2025.3581085

  52. [52]

    Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, Kun Yu, Yuxing Yuan, Yinghao Zou, Jiquan Long, Yudong Cai, Zhenxiang Li, Zhifeng Zhang, Yihua Mo, Jun Gu, Ruiyi Jiang, Yi Wei, and Charles Xie. 2021. Milvus: A Purpose-Built Vector Data Management System. InProceedings of the 2021 ...

  53. [53]

    Weaviate. 2025. Weaviate Vector Database. https://weaviate.io/. Accessed: 2026-05-15

  54. [54]

    Chuangxian Wei, Bin Wu, Sheng Wang, Renjie Lou, Chaoqun Zhan, Feifei Li, and Yuanzhe Cai. 2020. AnalyticDB-V: a hybrid analytical engine towards query fusion for structured and unstructured data.Proc. VLDB Endow.13, 12 (Aug. 2020), 3152–3165. https://doi.org/10.14778/3415478.3415541

  55. [55]

    Bowen Wu, Wei Cui, Carlo Curino, Matteo Interlandi, and Rathijit Sen. 2025. Terabyte-Scale Analytics in the Blink of an Eye. arXiv:2506.09226 [cs.DB] https://arxiv.org/abs/2506.09226

  56. [56]

    Bowen Wu, Dimitrios Koutsoukos, and Gustavo Alonso. 2025. Efficiently Pro- cessing Joins and Grouped Aggregations on GPUs.Proc. ACM Manag. Data3, 1, Article 39 (Feb. 2025), 27 pages. https://doi.org/10.1145/3709689

  57. [57]

    Jingyi Xi, Chenghao Mo, Ben Karsin, Artem Chirkin, Mingqin Li, and Minjia Zhang. 2025. VecFlow: A High-Performance Vector Data Management System for Filtered-Search on GPUs.Proc. ACM Manag. Data3, 4, Article 271 (Sept. 2025), 27 pages. https://doi.org/10.1145/3749189

  58. [58]

    Jiadong Xie, Jeffrey Xu Yu, and Yingfan Liu. 2025. Fast Approximate Similarity Join in Vector Databases.Proc. ACM Manag. Data3, 3, Article 158 (June 2025), 26 pages. https://doi.org/10.1145/3725403

  59. [59]

    Bobbi Yogatama, Yifei Yang, Kevin Kristensen, Devesh Sarda, Abigale Kim, Adrian Cockcroft, Yu Teng, Joshua Patterson, Gregory Kimball, Wes McKin- ney, et al. 2025. Rethinking Analytical Processing in the GPU Era.arXiv preprint arXiv:2508.04701(2025)

  60. [60]

    Qianxi Zhang, Shuotao Xu, Qi Chen, Guoxin Sui, Jiadong Xie, Zhizhen Cai, Yaoqi Chen, Yinxuan He, Yuqing Yang, Fan Yang, Mao Yang, and Lidong Zhou

  61. [61]

    In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)

    VBASE: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 377–395. https://www.usenix.org/conference/osdi23/presentation/zhang-qianxi

  62. [62]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou

  63. [63]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025)

  64. [64]

    Zili Zhang, Fangyue Liu, Gang Huang, Xuanzhe Liu, and Xin Jin. 2024. Fast vector query processing for large datasets beyond GPU memory with reordered pipelin- ing. InProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation(Santa Clara, CA, USA)(NSDI’24). USENIX Association, USA, Article 2, 18 pages

  65. [65]

    Jiaxu Zhu, Jiayu Yuan, Kaiwen Yang, Xiaobao Chen, Shihuan Yu, Hongchang Lv, Yan Li, and Bolong Zheng. 2025. An Experimental Evaluation of Hybrid Querying on Vectors.Proc. VLDB Endow.19, 2 (Oct. 2025), 183–195. https: //doi.org/10.14778/3773749.3773757 14