pith. sign in

arxiv: 2604.16725 · v1 · submitted 2026-04-17 · 💻 cs.DB · cs.DC· cs.DS· cs.ET

FliX: Flipped-Indexing for Scalable GPU Queries and Updates

Pith reviewed 2026-05-10 06:33 UTC · model grok-4.3

classification 💻 cs.DB cs.DCcs.DScs.ET
keywords GPU concurrent data structuresflipped indexingdynamic updatesordered queriesB-treeLSM-treehash tableswarp assignment
0
0 comments X

The pith

FliX replaces index traversal with bucket-assigned warps and batch binary search to support dynamic ordered GPU data structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that traditional GPU concurrent data structures pay a heavy cost for maintaining an index layer to direct operations to data buckets, especially when updates are frequent. FliX instead assigns each warp to a bucket and lets the bucket locate its own operations inside the input batch via binary search. This change removes the need to keep or walk an index, cuts redundant traversals and divergence, and still preserves full ordering plus dynamic memory management. A reader cares because the approach delivers substantially lower query latency and higher update throughput than existing ordered GPU structures while remaining competitive with unordered hash tables.

Core claim

FliX is a comparison-based flipped indexing design in which compute units (warps) are bound to buckets in the data layer; each bucket then identifies the subset of a batched operation stream that it owns by performing a single binary search over the sorted batch. The design eliminates the separate index layer entirely, replaces repeated index traversals with per-bucket searches, simplifies update paths, and continues to support range queries, successor queries, and dynamic memory reclamation.

What carries the argument

Compute-to-bucket mapping together with per-bucket binary search over the operation batch, which substitutes for index-layer traversals.

If this is right

  • Query latency falls by a factor of 6.5 relative to a leading GPU B-tree and 1.5 relative to a leading GPU LSM-tree.
  • Throughput per unit of memory rises by a factor of four over other ordered GPU structures.
  • Insertion throughput in update-heavy workloads exceeds the nearest fully dynamic ordered baseline by more than eight times.
  • Query and deletion performance exceed that of state-of-the-art unordered GPU hash tables while still supporting ordering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bucket-centric assignment pattern could be applied to other many-core or vector architectures that suffer from index-induced divergence.
  • Bucket-level binary search may scale more gracefully than tree descent when the number of concurrent operations grows far beyond the number of buckets.
  • The memory-footprint advantage could make FliX attractive for GPU-resident indexes inside larger query engines that already keep data on the device.

Load-bearing premise

That mapping warps directly to buckets and locating operations via batch binary search will avoid introducing new contention, warp divergence, or correctness problems while still allowing dynamic memory reclamation and full ordering.

What would settle it

A workload in which bucket binary searches produce measurably higher warp divergence or lower throughput than index traversals under the same update intensity and bucket occupancy.

Figures

Figures reproduced from arXiv: 2604.16725 by Felix Schuhknecht, Justus Henneberg, Rosina Kharal, Trevor Brown.

Figure 1
Figure 1. Figure 1: Conceptual Figures: (a) index layer search required for each query key, (b) reduce index layer searches if bucket ranges [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: State-of-the-art GPU CDSs against FliX; alternating [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) FliX initial build step: keys are divided into buckets. (b) A batch of keys is inserted. Each thread fetches a key from [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: In FliX, each bucket binary searches the query list [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Full heat map across varying insertion workloads with 4 rounds of insertions. The x-axis represents insertion [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of FliX insert kernels (ST or TL). (a) Uni [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Experiments with 4 consecutive rounds of insertions across 3 workloads of varying key densities. The [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: FliX (TL-Bulk) query performance against baselines following rounds of insertions and deletions. Figure (a) shows [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Average query performance across builds. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Unsorted query performance, baselines vs FliX [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Successor results comparing the LSMu and FliX. [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
read the original abstract

GPU-based concurrent data structures (CDSs) achieve high throughput for read-only queries, but efficient support for dynamic updates on fully GPU-resident data remains challenging. Ordered CDSs (e.g., B-trees and LSM-trees) maintain an index layer that directs operations to a data layer (buckets or leaves), while hash tables avoid the cost of maintaining order but do not support range or successor queries. On GPUs, maintaining and traversing an index layer under frequent updates introduces contention and warp divergence. To tackle these problems, we flip the indexing paradigm on its head with FliX, a comparison-based, flipped indexing strategy for dynamic, fully GPU-resident CDSs. Traditional GPU CDSs typically take a batch of operations and assign each operation to a GPU thread or warp. FliX, however, assigns compute (e.g., a warp) to each bucket in the data layer, and each bucket then locates operations it is responsible for in the batch. FliX can replace many index layer traversals with a single binary search on the batch, reducing redundant work and warp divergence. Further, FliX simplifies updates as no index layer must be maintained. In our experiments, FliX achieves 6.5x reduced query latency compared to a leading GPU B-tree and 1.5x compared to a leading GPU LSM-tree, while delivering 4x higher throughput per memory footprint than ordered competitors. Despite maintaining order, FliX also surpasses state-of-the-art unordered GPU hash tables in query and deletion performance, and is highly competitive in insertion performance. In update-heavy workloads, it outperforms the closest fully dynamic ordered baseline by over 8x in insertion throughput while supporting dynamic memory reclamation. These results suggest that eliminating the index layer and adopting a compute-to-bucket mapping can enable practical, fully dynamic GPU indexing without sacrificing query performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FliX, a flipped-indexing strategy for fully GPU-resident concurrent data structures. Traditional designs assign operations to threads/warps and traverse an index layer to reach data buckets; FliX instead assigns warps to buckets in the data layer, which locate their responsible operations via a single binary search over a sorted batch. This eliminates index-layer maintenance and traversals, reducing contention and warp divergence while preserving full ordering, range/successor queries, dynamic updates, and memory reclamation. Experiments report 6.5× lower query latency than a leading GPU B-tree, 1.5× than a leading GPU LSM-tree, 4× higher throughput per memory footprint than ordered competitors, and >8× insertion throughput versus the closest dynamic ordered baseline in update-heavy workloads.

Significance. If the performance and correctness claims are substantiated, FliX offers a meaningful shift in GPU CDS design by showing that a compute-to-bucket mapping can deliver both high query performance and practical dynamic updates without an index layer. This addresses a long-standing tension between ordering and GPU efficiency and could influence future GPU database kernels and concurrent structures.

major comments (2)
  1. [Abstract] Abstract and experimental claims: concrete speedups (6.5× query latency vs. GPU B-tree, 1.5× vs. LSM-tree, 4× throughput-per-footprint, >8× insertion throughput) are stated without any description of hardware, baseline implementations, workload definitions, number of runs, error bars, or statistical tests. This directly undermines evaluation of the central performance claims.
  2. [§3] §3 (FliX design) and dynamic operations: the core assumption that warp-to-bucket assignment plus per-bucket binary search on the operation batch avoids new contention, divergence, or synchronization costs during bucket splits and memory reclamation is load-bearing. If bucket occupancies become imbalanced or boundary updates require per-operation atomics, the reported latency and throughput advantages would not hold; no microbenchmarks or divergence analysis of these paths are provided.
minor comments (2)
  1. [§4] Ensure every baseline (e.g., the specific 'leading GPU B-tree' and 'leading GPU LSM-tree') is named with citation and version in the experimental section so readers can reproduce the comparisons.
  2. [§3] Clarify how the sorted operation batch is produced and whether its construction cost is included in the reported latencies and throughputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, acknowledging where revisions are needed to strengthen the presentation of results and analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental claims: concrete speedups (6.5× query latency vs. GPU B-tree, 1.5× vs. LSM-tree, 4× throughput-per-footprint, >8× insertion throughput) are stated without any description of hardware, baseline implementations, workload definitions, number of runs, error bars, or statistical tests. This directly undermines evaluation of the central performance claims.

    Authors: We agree that the abstract would be improved by providing readers with immediate context for the reported speedups. The full details—including NVIDIA A100 GPU hardware, baseline implementations drawn from the cited GPU B-tree and LSM-tree papers, YCSB and custom range/update workloads, 10 runs with mean and standard deviation, and absence of statistical hypothesis tests—are presented in Section 5. We will revise the abstract to add a concise qualifier (e.g., “on NVIDIA A100 GPUs using standard benchmarks”) and an explicit pointer to the evaluation section. This change preserves abstract brevity while directly addressing the concern. revision: partial

  2. Referee: [§3] §3 (FliX design) and dynamic operations: the core assumption that warp-to-bucket assignment plus per-bucket binary search on the operation batch avoids new contention, divergence, or synchronization costs during bucket splits and memory reclamation is load-bearing. If bucket occupancies become imbalanced or boundary updates require per-operation atomics, the reported latency and throughput advantages would not hold; no microbenchmarks or divergence analysis of these paths are provided.

    Authors: The referee correctly notes that the claimed advantages depend on dynamic paths not introducing hidden costs. Section 3 explains that splits trigger warp reassignment to newly created buckets followed by a single batch binary search, and reclamation uses deferred batch-level synchronization rather than per-operation atomics. We acknowledge the absence of dedicated microbenchmarks or divergence profiles for these paths. We will add a short analysis subsection (3.4) containing warp-divergence measurements (via Nsight Compute) and atomic-operation counts for split and reclamation under both balanced and deliberately imbalanced bucket occupancies. This will substantiate that the design remains efficient in these cases. revision: yes

Circularity Check

0 steps flagged

No circularity detected in algorithmic design or experimental evaluation

full rationale

The paper describes a new GPU data structure design (flipped indexing with warp-to-bucket assignment and per-bucket binary search on operation batches) and reports empirical performance results against baselines. No equations, fitted parameters, predictions derived from inputs, or self-citations appear in the provided text. The central claims rest on experimental measurements of latency and throughput rather than any derivation that reduces to its own premises by construction. This is a standard empirical algorithms paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the approach rests on standard GPU programming assumptions (warp execution, memory coalescing) and the existence of comparison-based ordering; no free parameters, ad-hoc axioms, or invented entities are introduced or quantified.

pith-pipeline@v0.9.0 · 5661 in / 1134 out tokens · 64665 ms · 2026-05-10T06:33:33.348483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D

    Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta. 2009. Real-time parallel hashing on the GPU.ACM Trans. Graph.28, 5 (2009), 154. https: //doi.org/10.1145/1618452.1618500

  2. [2]

    Saman Ashkiani, Martin Farach-Colton, and John D. Owens. 2018. A Dynamic Hash Table for the GPU. In2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, Vancouver, BC, Canada, May 21-25, 2018. IEEE Computer Society, Vancouver, BC, Canada, 419–429. https://doi.org/10.1109/ IPDPS.2018.00052

  3. [3]

    Saman Ashkiani, Shengren Li, Martin Farach-Colton, Nina Amenta, and John D. Owens. 2018. GPU LSM: A Dynamic Dictionary Data Structure for the GPU. In 12 FliX: Flipped-Indexing for Scalable GPU Queries and Updates 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, Vancouver, BC, Canada, May 21-25, 2018. IEEE Computer Society...

  4. [4]

    2022.Fully Concurrent GPU Data Structures

    Muhammad Abdelghaffar Awad. 2022.Fully Concurrent GPU Data Structures. Ph.D. Dissertation. University of California, Davis

  5. [5]

    In: PPoPP

    Muhammad A. Awad, Saman Ashkiani, Rob Johnson, Martín Farach-Colton, and John D. Owens. 2019. Engineering a high-performance GPU B-Tree. InProceed- ings of the 24th Symposium on Principles and Practice of Parallel Programming (Washington, District of Columbia)(PPoPP ’19). Association for Computing Ma- chinery, New York, NY, USA, 145–157. https://doi.org/1...

  6. [6]

    Jiashen Cao, Rathijit Sen, Matteo Interlandi, Joy Arulraj, and Hyesoon Kim. 2023. Gpu database systems characterization and optimization.Proceedings of the VLDB Endowment17, 3 (2023), 441–454. https://doi.org/10.14778/3632093.3632107

  7. [7]

    Mark Harris and Kyrylo Perelygin. 2017. Cooperative Groups: Flexible CUDA Thread Programming. (2017). https://developer.nvidia.com/blog/cooperative- groups/

  8. [8]

    Justus Henneberg and Felix Schuhknecht. 2023. RTIndeX: Exploiting Hardware- Accelerated GPU Raytracing for Database Indexing.Proc. VLDB Endow.16, 13 (2023), 4268–4281. https://www.vldb.org/pvldb/vol16/p4268-schuhknecht.pdf

  9. [9]

    Justus Henneberg, Felix Schuhknecht, Rosina Kharal, and Trevor Brown. 2025. More Bang for Your Buck(et): Fast and Space-Efficient Hardware-Accelerated Coarse-Granular Indexing on GPUs. In2025 IEEE 41st International Conference on Data Engineering (ICDE). IEEE, Hong Kong, China, 1320–1333. https://doi. org/10.1109/ICDE65448.2025.00103

  10. [10]

    Daniel Jünger, Robin Kobus, André Müller, Christian Hundt, Kai Xu, Weiguo Liu, and Bertil Schmidt. 2020. WarpCore: A Library for fast Hash Tables on GPUs. In 27th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2020, Pune, India, December 16-19, 2020. IEEE, Pune, India, 11–20. https://doi.org/10.1109/HIPC50609.2020.00015

  11. [11]

    Brenton Lessley, Shaomeng Li, and Hank Childs. 2020. HashFight: A Platform- Portable Hash Table for Multi-Core and Many-Core Architectures.Electronic Imaging32, 1 (2020), 376–1–376–13. https://doi.org/10.2352/ISSN.2470-1173. 2020.1.VDA-376

  12. [12]

    Hao Li, Yi-Cheng Tu, and Bo Zeng. 2019. Concurrent query processing in a GPU-based database system.PloS one14, 4 (2019), e0214720

  13. [13]

    Chunbin Lin, Benjamin Mandel, Yannis Papakonstantinou, and Matthias Springer

  14. [14]

    VLDB Endow.10, 3 (Nov

    Fast In-Memory SQL Analytics on Typed Graphs.Proceedings of the VLDB Endowment10, 3 (2016), 265–276. https://doi.org/10.14778/3021924.3021941

  15. [15]

    Sparsh Mittal and Jeffrey S. Vetter. 2015. A Survey of CPU-GPU Heterogeneous Computing Management: GPU Architectures, Data Management and Scheduling Strategies.ACM Computing Surveys (CSUR)47, 4 (2015), 1–38. https://doi.org/ 10.1145/2788396

  16. [16]

    Nurit Moscovici, Nachshon Cohen, and Erez Petrank. 2017. A gpu-friendly skiplist algorithm. In2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, Portland, OR, USA, 246–259. https: //doi.org/10.1109/PACT.2017.13

  17. [17]

    Johns Paul, Shengliang Lu, and Bingsheng He. 2021. Database systems on GPUs. Foundations and Trends in Databases11, 1 (2021), 1–108. https://doi.org/10.1561/ 1900000076

  18. [18]

    Amirhesam Shahvarani and Hans-Arno Jacobsen. 2016. A hybrid B+-tree as solution for in-memory indexing on CPU-GPU heterogeneous computing plat- forms. InProceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 1523–1538

  19. [19]

    Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A study of the funda- mental performance characteristics of GPUs and CPUs for database analytics. In Proceedings of the 2020 ACM SIGMOD international conference on Management of data. Association for Computing Machinery, New York, NY, USA, 1617–1632. https://doi.org/10.1145/3318464.3380595

  20. [20]

    Harshit Sharma and Anmol Sharma. 2024. A Comprehensive Overview of GPU Accelerated Databases.CoRRabs/2406.13831 (2024). https://doi.org/10.48550/ ARXIV.2406.13831 arXiv:2406.13831

  21. [21]

    Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, and Jason Cong. 2014. An Efficient Design and Implementation of LSM-tree Based Key-Value Store on Open-Channel SSD. InProceedings of the Ninth European Conference on Computer Systems. Association for Computing Machinery, New York, NY, USA, Article 16, 14 pages. https://doi.org/10.11...

  22. [22]

    2025.Fast Database Join on Ray-tracing Core Equipped GPU

    Yijie Wu. 2025.Fast Database Join on Ray-tracing Core Equipped GPU. Ph.D. Dissertation. University of Victoria

  23. [23]

    Kai Zhang, Kaibo Wang, Yuan Yuan, Lei Guo, Rubao Lee, and Xiaodong Zhang

  24. [24]

    VLDB Endow.8, 11 (2015), 1226–1237

    Mega-KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores.Proc. VLDB Endow.8, 11 (2015), 1226–1237. https://doi.org/10. 14778/2809974.2809984 13