arxiv: 2604.21117 · v1 · submitted 2026-04-22 · 💻 cs.AR · cs.DB· cs.DC

Efficient Batch Search Algorithm for B+ Tree Index Structures with Level-Wise Traversal on FPGAs

Max Tzschoppe , Martin Wilhelm , Sven Groppe , Thilo Pionteck This is my paper

Pith reviewed 2026-05-09 22:28 UTC · model grok-4.3

classification 💻 cs.AR cs.DBcs.DC

keywords B+ treeFPGAbatch searchlevel-wise traversalindex structureshigh-level synthesisperformance accelerationdata center accelerator

0 comments

The pith

B+ tree searches on FPGAs achieve up to 4.9x speedup over single-threaded CPUs by processing search batches level by level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a batch search algorithm for B+ trees that traverses the tree level by level for a group of keys instead of one at a time. This design loads each tree node only once per level across all searches in the batch, cutting down on expensive memory accesses. The approach is implemented as a configurable kernel using high-level synthesis for FPGAs, allowing adjustments for batch size, node size, and tree depth. Evaluation on an AMD Alveo U250 shows clear performance gains against standard CPU implementations for large batches and million-entry trees. If the method holds, it suggests FPGAs can accelerate index operations in data-intensive applications where batch processing is feasible.

Core claim

The central claim is that a level-wise traversal of the B+ tree for batches of search keys, implemented on FPGA, reduces global memory accesses through node reuse and enables parallel comparisons, yielding a 4.9x speedup for a batch of 1000 keys on a 1-million entry tree compared to single-threaded CPU, and 2.1x over 16-threaded CPU when running four kernels in parallel.

What carries the argument

Level-wise batch traversal: processes all search keys at the current tree level before advancing to the next, maximizing reuse of loaded nodes and allowing on-FPGA parallel key comparisons.

Load-bearing premise

That processing searches in batches level by level will reduce memory accesses and improve reuse enough to offset data transfer costs and FPGA synthesis constraints for typical workloads.

What would settle it

Measure the FPGA search time for a batch size of 10 on a 1-million entry B+ tree and compare it directly to the CPU time; if the FPGA is slower, the batch size threshold for benefits would be shown.

Figures

Figures reproduced from arXiv: 2604.21117 by Martin Wilhelm, Max Tzschoppe, Sven Groppe, Thilo Pionteck.

**Figure 2.** Figure 2: provides an overview of the proposed implementation strategy. The core idea is to process queries in batches of search keys S by traversing the B+ tree level by level, while storing intermediate results in a FIFO buffer. Each entry in the result FIFO contains two fields: the address A of a node and the number # of search keys that need to be compared with that node. The general flow of our search algorithm… view at source ↗

**Figure 3.** Figure 3: Node Layout The node layout for inner and leaf nodes are illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Key Comparison Procedure. the node to determine the correct child node address. Each individual comparison involves comparing a 32-byte search key with a 32-byte node key. This is implemented using 32 parallel 8-bit comparators, each producing a 3-bit result indicating whether the corresponding byte is less than, equal to, or greater than its counterpart. For each comparison, only one of these three bits c… view at source ↗

**Figure 6.** Figure 6: System Architecture As the focus of this work is on accelerating a static B+ tree, the index structure is maintained on the host system. For this purpose, we utilize the B+ tree implementation provided by the TLX Library [18]. The B+ trees are generated using the TLX library and we employ host-side code to transform the tree structure into flat arrays suitable for transfer to the FPGA’s global memory. Base… view at source ↗

**Figure 5.** Figure 5: Batch distribution for single and multiple kernel instances. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: Interaction between Host and FPGA. For our measurements, we use a system with a static B+ tree index structure that is transferred to the FPGA once and reused thereafter. Therefore, the overhead of this one-time initialization phase becomes negligible in a productive system. The experimental results are obtained using the flow shown in Fig. 7a. For all results, we report the the kernel and compute unit exe… view at source ↗

**Figure 8.** Figure 8: Single Instance Designs - CU ( ) and Kernel ( ) execution times with a fix tree size of 1 million entries. Each configuration is 10× repeated. is a constant overhead of approximately 150 µs between the kernel and CU times. The CU time reflects the duration for transferring nodes between global memory and the FPGA, as well as the duration for executing all search operations for the given batch. As expected,… view at source ↗

**Figure 10.** Figure 10: Single-Threaded CPU - FPGA kernel ( ) and CPU search ( ) execution times with a fix tree size of 1 million entries. FPGA configuration is 10× and CPU is 100× repeated. 0 200 400 600 800 1,000 0 2,000 4,000 6,000 8,000 10,000 Batch Size Time in µs 4 Inst 1 Thread 2 Threads 4 Threads 16 Threads [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Multi-Threaded CPU - FPGA kernel ( ) and CPU search ( ) execution times with a fix tree size of 1 million entries and a tree order of m = 16. FPGA configuration is 10× and CPU is 100× repeated. It can be concluded that the FPGA and CPU implementations achieve optimal performance with a tree order of m = 16. Therefore, only the m = 16 tree order will be considered in the following. Multi-Threaded CPU. Dat… view at source ↗

read the original abstract

This paper introduces a search algorithm for index structures based on a B+ tree, specifically optimized for execution on a field-programmable gate array (FPGA). Our implementation efficiently traverses and reuses tree nodes by processing a batch of search keys level by level. This approach reduces costly global memory accesses, improves reuse of loaded B+ tree nodes, and enables parallel search key comparisons directly on the FPGA. Using a high-level synthesis (HLS) approach, we developed a highly flexible and configurable search kernel design supporting variable batch sizes, customizable node sizes, and arbitrary tree depths. The final design was implemented on an AMD Alveo U250 Data Center Accelerator Card, and was evaluated against the B+ tree search algorithm from the TLX library running on an AMD EPYC 7542 processor (2.9 GHz). With a batch size of 1000 search keys, a B+ tree containing one million entries, and a tree order of 16, we measured a 4.9x speedup for the single-kernel FPGA design compared to a single-threaded CPU implementation. Running four kernel instances in parallel on the FPGA resulted in a 2.1$\times$ performance improvement over a CPU implementation using 16 threads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a working HLS FPGA kernel for level-wise batched B+ tree search with concrete speedups on Alveo U250, but the numbers may not survive once PCIe transfers are counted.

read the letter

This paper gives a concrete FPGA implementation for searching B+ trees in batches by traversing level by level. They use high-level synthesis to make a configurable kernel that can handle different batch sizes and tree parameters, and they put it on an AMD Alveo U250 card. Against the TLX library on a 16-core EPYC, they show 4.9 times faster for one kernel versus one CPU thread, and 2.1 times faster for four kernels versus 16 threads, all for a million-entry tree of order 16 with 1000-key batches. The new part is the specific combination of level-wise batch processing on FPGA with direct hardware comparison. The design reuses nodes across the batch to cut down on global memory hits, which is a reasonable optimization for this workload. The configurability through HLS is useful for adapting to different tree shapes without rewriting the hardware description. It does a good job showing that you can get real speedups on off-the-shelf FPGA cards for this kind of index lookup. The numbers are from actual runs, not just simulation. The soft spot is the timing methodology. The abstract and description do not break out whether the FPGA times include moving the tree and the search batch over PCIe from the host. A 1M-entry B+ tree is tens of megabytes, so that transfer could eat into the gains. If the CPU runs everything in RAM and the FPGA numbers are only kernel time, the 4.9x and 2.1x figures are not apples-to-apples. They also skip error bars or run-to-run variation, and the memory access details are light. Overall the central idea holds up as a practical engineering effort, but the performance story needs more transparency on the full data path. This paper is for people building or evaluating hardware accelerators for database indexes. A reader who works on FPGA offloads for data center workloads will get something usable from the design and the numbers. It is worth sending to peer review because the implementation is real and the claims are testable, even if the evaluation section needs tightening.

Referee Report

1 major / 2 minor

Summary. The paper presents a level-wise batch traversal algorithm for B+ tree searches optimized for FPGAs, implemented via HLS to support configurable batch sizes, node orders, and tree depths. It claims reduced global memory accesses through node reuse and parallel comparisons, with concrete speedups of 4.9× (single kernel vs. 1-thread CPU) and 2.1× (4 parallel kernels vs. 16-thread CPU) for a 1M-entry tree of order 16 and batch size 1000, measured on Alveo U250 against the TLX library on EPYC 7542.

Significance. If the speedups hold under end-to-end measurement, the work demonstrates a practical, configurable HLS-based FPGA accelerator for batch index searches, with explicit parameters (batch 1000, 1M entries, order 16) and direct wall-clock comparisons providing a reproducible baseline. The level-wise traversal approach addresses a real memory-access bottleneck in tree structures.

major comments (1)

[Evaluation] Evaluation section (and abstract): the reported wall-clock speedups of 4.9× and 2.1× do not state whether FPGA times include PCIe host-to-device transfers for the ~1M-node tree and batch data. The CPU baseline runs entirely in host memory, so excluding transfers would render the comparisons non-equivalent and undermine the central performance claims.

minor comments (2)

[Abstract] No error bars, run-to-run variance, or methodology details on memory layout and data movement are provided, weakening confidence in the measured speedups.
The description of how level-wise traversal reduces accesses and improves reuse could be expanded with a concrete example or diagram for a small tree.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying the need to clarify the scope of our wall-clock measurements. We address this point directly below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Evaluation section (and abstract): the reported wall-clock speedups of 4.9× and 2.1× do not state whether FPGA times include PCIe host-to-device transfers for the ~1M-node tree and batch data. The CPU baseline runs entirely in host memory, so excluding transfers would render the comparisons non-equivalent and undermine the central performance claims.

Authors: We acknowledge that the manuscript does not explicitly state whether the reported FPGA wall-clock times include PCIe transfers. In our setup the measurements are taken from the host application as end-to-end wall-clock time, which includes host-to-device transfer of the tree structure and batch keys, kernel execution, and device-to-host transfer of results. This reflects the practical cost of using the FPGA accelerator. We will revise the Evaluation section and abstract to state this explicitly, thereby removing any ambiguity and confirming that the comparisons are end-to-end. If the referee prefers an additional breakdown that isolates compute time (excluding transfers), we can also add those figures in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical FPGA implementation evaluated against external CPU library

full rationale

The paper describes a hardware implementation of level-wise batch search on B+ trees for FPGAs, synthesized via HLS and measured on an Alveo U250 card. All performance numbers (4.9× single-kernel vs 1-thread CPU; 2.1× 4-kernel vs 16-thread CPU) are obtained from direct wall-clock timing of the synthesized kernel against the TLX library running on an AMD EPYC 7542. No equations, fitted parameters, or derivations appear in the provided text; the algorithm is presented as a concrete design choice rather than a mathematical prediction that reduces to its own inputs. The comparison uses an independent external baseline, satisfying the criterion for non-circular empirical work. Potential questions about PCIe transfer inclusion affect measurement validity but do not constitute circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on standard B+ tree node layout and FPGA memory hierarchy assumptions without introducing new fitted constants or postulated entities.

axioms (2)

domain assumption FPGA on-chip memory and global memory bandwidth suffice to hold and stream B+ tree nodes for the chosen batch sizes and tree orders
Invoked implicitly when claiming reduced global memory accesses and node reuse
domain assumption High-level synthesis produces a functionally correct and timing-closed kernel for the described level-wise traversal
Required for the configurable kernel to deliver the stated parallelism

pith-pipeline@v0.9.0 · 5537 in / 1548 out tokens · 38009 ms · 2026-05-09T22:28:55.482226+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references

[1]

A performance evaluation of in-memory databases,

A. T. Kabakus and R. Kara, “A performance evaluation of in-memory databases,”Journal of King Saud University - Computer and Information Sciences, vol. 29, no. 4, pp. 520–525, Oct. 2017

2017
[2]

Meet the walkers: accelerating index traversals for in- memory databases,

O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ran- ganathan, “Meet the walkers: accelerating index traversals for in- memory databases,” inProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-46. ACM, Dec. 2013, pp. 468–479

2013
[3]

Jarke,Fundamentals of Data Warehouses, 1st ed., M

M. Jarke,Fundamentals of Data Warehouses, 1st ed., M. Lenzerini, Y . Vassiliou, and P. Vassiliadis, Eds. Berlin/Heidelberg: Springer Berlin Heidelberg, 2003, description based on publisher supplied metadata and other sources

2003
[4]

HPCache: memory- efficient OLAP through proportional caching revisited,

H. Nicholson, P. Chrysogelos, and A. Ailamaki, “HPCache: memory- efficient OLAP through proportional caching revisited,”The VLDB Journal, vol. 33, no. 6, pp. 1775–1791, Dec. 2023

2023
[5]

Opinion Pieces of the BTW 2025 Workshop On Advances in Cloud Data Management,

T. Bodner, A. B ¨ohm, M. B ¨other, A. Klimovic, D. Durner, M. Grund, A. Kipf, I. Oukid, B. Schiefer, P. Parchas, H. Gildhoff, P. Unterbrunner, T. Karnagel, J. Giceva, T. Ziegler, and M. Hentschel, “Opinion Pieces of the BTW 2025 Workshop On Advances in Cloud Data Management,” Datenbank-Spektrum, vol. 25, no. 3, pp. 187–195, Nov. 2025

2025
[6]

Schneider,Implementierungskonzepte fuer Datenbanksysteme, 1st ed

M. Schneider,Implementierungskonzepte fuer Datenbanksysteme, 1st ed. Springer Berlin Heidelberg, 2004

2004
[7]

Honeycomb: Ordered key-value store acceleration on an FPGA-based smartNIC,

J. Liu, A. Dragojevi ´c, S. Fleming, A. Katsarakis, D. Korolija, I. Zablotchi, H.-C. Ng, A. Kalia, and M. Castro, “Honeycomb: Ordered key-value store acceleration on an FPGA-based smartNIC,”IEEE Trans- actions on Computers, vol. 73, no. 3, pp. 857–871, Mar. 2024

2024
[8]

A novel FPGA-based high throughput accelerator for binary search trees,

O. Melikoglu, O. Ergin, B. Salami, J. Pavon, O. Unsal, and A. Cristal, “A novel FPGA-based high throughput accelerator for binary search trees,” in2019 International Conference on High Performance Computing & Simulation (HPCS). IEEE, Jul. 2019, pp. 612–619

2019
[9]

Hybrid FPGA approach for a b+ tree in a semantic web database system,

D. Heinrich, S. Werner, M. Stelzner, C. Blochwitz, T. Pionteck, and S. Groppe, “Hybrid FPGA approach for a b+ tree in a semantic web database system,” in2015 10th International Symposium on Recon- figurable Communication-centric Systems-on-Chip (ReCoSoC). IEEE, Jun. 2015, pp. 1–8

2015
[10]

In- memory database acceleration on FPGAs: a survey,

J. Fang, Y . T. B. Mulder, J. Hidders, J. Lee, and H. P. Hofstee, “In- memory database acceleration on FPGAs: a survey,”The VLDB Journal, vol. 29, no. 1, pp. 33–59, Oct. 2019

2019
[11]

A hybrid b+-tree as solution for in- memory indexing on CPU-GPU heterogeneous computing platforms,

A. Shahvarani and H.-A. Jacobsen, “A hybrid b+-tree as solution for in- memory indexing on CPU-GPU heterogeneous computing platforms,” inProceedings of the 2016 International Conference on Management of Data, ser. SIGMOD/PODS’16. ACM, Jun. 2016, pp. 1523–1538

2016
[12]

CuART - a CUDA-based, scalable radix-tree lookup and update engine,

M. Koppehel, T. Groth, S. Groppe, and T. Pionteck, “CuART - a CUDA-based, scalable radix-tree lookup and update engine,” in50th International Conference on Parallel Processing, ser. ICPP 2021. ACM, Aug. 2021, pp. 1–10

2021
[13]

(2025, Jul.) Mysql 8.4

Oracle. (2025, Jul.) Mysql 8.4. [Online]. Available: https://dev.mysql. com/doc/refman/8.0/en/innodb-index-types.html

2025
[14]

(2025) Postgresql 17.5

PostgreSQL Global Development Group. (2025) Postgresql 17.5. [Online]. Available: https://www.postgresql.org/docs/current/ indexes-types.html

2025
[15]

(2025, Jan.) Db2 database

IBM Corporation. (2025, Jan.) Db2 database. IBM Corporation. [Online]. Available: https://www.ibm.com/docs/en/db2/12.1.0?topic= indexes-index-structure

2025
[16]

(2025, Apr.) Oracle database, database concepts, 21c

Oracle. (2025, Apr.) Oracle database, database concepts, 21c. Oracle. [Online]. Available: https://docs.oracle.com/en/database/oracle/ oracle-database/21/cncpt/indexes-and-index-organized-tables.html

2025
[17]

Organization and maintenance of large ordered indexes,

R. Bayer and E. M. McCreight, “Organization and maintenance of large ordered indexes,”Acta Informatica, vol. 1, no. 3, pp. 173–189, 1972

1972
[18]

Bingmann

T. Bingmann. (2018) TLX: Collection of sophisticated C++ data structures, algorithms, and miscellaneous helpers. [Online]. Available: https://panthema.net/tlx

2018
[19]

Dodge,The concise encyclopedia of statistics

Y . Dodge,The concise encyclopedia of statistics. Springer Science + Business Medi, 2008

2008