Efficient Batch Search Algorithm for B+ Tree Index Structures with Level-Wise Traversal on FPGAs
Pith reviewed 2026-05-09 22:28 UTC · model grok-4.3
The pith
B+ tree searches on FPGAs achieve up to 4.9x speedup over single-threaded CPUs by processing search batches level by level.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a level-wise traversal of the B+ tree for batches of search keys, implemented on FPGA, reduces global memory accesses through node reuse and enables parallel comparisons, yielding a 4.9x speedup for a batch of 1000 keys on a 1-million entry tree compared to single-threaded CPU, and 2.1x over 16-threaded CPU when running four kernels in parallel.
What carries the argument
Level-wise batch traversal: processes all search keys at the current tree level before advancing to the next, maximizing reuse of loaded nodes and allowing on-FPGA parallel key comparisons.
Load-bearing premise
That processing searches in batches level by level will reduce memory accesses and improve reuse enough to offset data transfer costs and FPGA synthesis constraints for typical workloads.
What would settle it
Measure the FPGA search time for a batch size of 10 on a 1-million entry B+ tree and compare it directly to the CPU time; if the FPGA is slower, the batch size threshold for benefits would be shown.
Figures
read the original abstract
This paper introduces a search algorithm for index structures based on a B+ tree, specifically optimized for execution on a field-programmable gate array (FPGA). Our implementation efficiently traverses and reuses tree nodes by processing a batch of search keys level by level. This approach reduces costly global memory accesses, improves reuse of loaded B+ tree nodes, and enables parallel search key comparisons directly on the FPGA. Using a high-level synthesis (HLS) approach, we developed a highly flexible and configurable search kernel design supporting variable batch sizes, customizable node sizes, and arbitrary tree depths. The final design was implemented on an AMD Alveo U250 Data Center Accelerator Card, and was evaluated against the B+ tree search algorithm from the TLX library running on an AMD EPYC 7542 processor (2.9 GHz). With a batch size of 1000 search keys, a B+ tree containing one million entries, and a tree order of 16, we measured a 4.9x speedup for the single-kernel FPGA design compared to a single-threaded CPU implementation. Running four kernel instances in parallel on the FPGA resulted in a 2.1$\times$ performance improvement over a CPU implementation using 16 threads.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a level-wise batch traversal algorithm for B+ tree searches optimized for FPGAs, implemented via HLS to support configurable batch sizes, node orders, and tree depths. It claims reduced global memory accesses through node reuse and parallel comparisons, with concrete speedups of 4.9× (single kernel vs. 1-thread CPU) and 2.1× (4 parallel kernels vs. 16-thread CPU) for a 1M-entry tree of order 16 and batch size 1000, measured on Alveo U250 against the TLX library on EPYC 7542.
Significance. If the speedups hold under end-to-end measurement, the work demonstrates a practical, configurable HLS-based FPGA accelerator for batch index searches, with explicit parameters (batch 1000, 1M entries, order 16) and direct wall-clock comparisons providing a reproducible baseline. The level-wise traversal approach addresses a real memory-access bottleneck in tree structures.
major comments (1)
- [Evaluation] Evaluation section (and abstract): the reported wall-clock speedups of 4.9× and 2.1× do not state whether FPGA times include PCIe host-to-device transfers for the ~1M-node tree and batch data. The CPU baseline runs entirely in host memory, so excluding transfers would render the comparisons non-equivalent and undermine the central performance claims.
minor comments (2)
- [Abstract] No error bars, run-to-run variance, or methodology details on memory layout and data movement are provided, weakening confidence in the measured speedups.
- The description of how level-wise traversal reduces accesses and improves reuse could be expanded with a concrete example or diagram for a small tree.
Simulated Author's Rebuttal
We thank the referee for the detailed review and for identifying the need to clarify the scope of our wall-clock measurements. We address this point directly below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Evaluation section (and abstract): the reported wall-clock speedups of 4.9× and 2.1× do not state whether FPGA times include PCIe host-to-device transfers for the ~1M-node tree and batch data. The CPU baseline runs entirely in host memory, so excluding transfers would render the comparisons non-equivalent and undermine the central performance claims.
Authors: We acknowledge that the manuscript does not explicitly state whether the reported FPGA wall-clock times include PCIe transfers. In our setup the measurements are taken from the host application as end-to-end wall-clock time, which includes host-to-device transfer of the tree structure and batch keys, kernel execution, and device-to-host transfer of results. This reflects the practical cost of using the FPGA accelerator. We will revise the Evaluation section and abstract to state this explicitly, thereby removing any ambiguity and confirming that the comparisons are end-to-end. If the referee prefers an additional breakdown that isolates compute time (excluding transfers), we can also add those figures in the revision. revision: yes
Circularity Check
No circularity: empirical FPGA implementation evaluated against external CPU library
full rationale
The paper describes a hardware implementation of level-wise batch search on B+ trees for FPGAs, synthesized via HLS and measured on an Alveo U250 card. All performance numbers (4.9× single-kernel vs 1-thread CPU; 2.1× 4-kernel vs 16-thread CPU) are obtained from direct wall-clock timing of the synthesized kernel against the TLX library running on an AMD EPYC 7542. No equations, fitted parameters, or derivations appear in the provided text; the algorithm is presented as a concrete design choice rather than a mathematical prediction that reduces to its own inputs. The comparison uses an independent external baseline, satisfying the criterion for non-circular empirical work. Potential questions about PCIe transfer inclusion affect measurement validity but do not constitute circularity in any derivation chain.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption FPGA on-chip memory and global memory bandwidth suffice to hold and stream B+ tree nodes for the chosen batch sizes and tree orders
- domain assumption High-level synthesis produces a functionally correct and timing-closed kernel for the described level-wise traversal
Reference graph
Works this paper leans on
-
[1]
A performance evaluation of in-memory databases,
A. T. Kabakus and R. Kara, “A performance evaluation of in-memory databases,”Journal of King Saud University - Computer and Information Sciences, vol. 29, no. 4, pp. 520–525, Oct. 2017
2017
-
[2]
Meet the walkers: accelerating index traversals for in- memory databases,
O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ran- ganathan, “Meet the walkers: accelerating index traversals for in- memory databases,” inProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-46. ACM, Dec. 2013, pp. 468–479
2013
-
[3]
Jarke,Fundamentals of Data Warehouses, 1st ed., M
M. Jarke,Fundamentals of Data Warehouses, 1st ed., M. Lenzerini, Y . Vassiliou, and P. Vassiliadis, Eds. Berlin/Heidelberg: Springer Berlin Heidelberg, 2003, description based on publisher supplied metadata and other sources
2003
-
[4]
HPCache: memory- efficient OLAP through proportional caching revisited,
H. Nicholson, P. Chrysogelos, and A. Ailamaki, “HPCache: memory- efficient OLAP through proportional caching revisited,”The VLDB Journal, vol. 33, no. 6, pp. 1775–1791, Dec. 2023
2023
-
[5]
Opinion Pieces of the BTW 2025 Workshop On Advances in Cloud Data Management,
T. Bodner, A. B ¨ohm, M. B ¨other, A. Klimovic, D. Durner, M. Grund, A. Kipf, I. Oukid, B. Schiefer, P. Parchas, H. Gildhoff, P. Unterbrunner, T. Karnagel, J. Giceva, T. Ziegler, and M. Hentschel, “Opinion Pieces of the BTW 2025 Workshop On Advances in Cloud Data Management,” Datenbank-Spektrum, vol. 25, no. 3, pp. 187–195, Nov. 2025
2025
-
[6]
Schneider,Implementierungskonzepte fuer Datenbanksysteme, 1st ed
M. Schneider,Implementierungskonzepte fuer Datenbanksysteme, 1st ed. Springer Berlin Heidelberg, 2004
2004
-
[7]
Honeycomb: Ordered key-value store acceleration on an FPGA-based smartNIC,
J. Liu, A. Dragojevi ´c, S. Fleming, A. Katsarakis, D. Korolija, I. Zablotchi, H.-C. Ng, A. Kalia, and M. Castro, “Honeycomb: Ordered key-value store acceleration on an FPGA-based smartNIC,”IEEE Trans- actions on Computers, vol. 73, no. 3, pp. 857–871, Mar. 2024
2024
-
[8]
A novel FPGA-based high throughput accelerator for binary search trees,
O. Melikoglu, O. Ergin, B. Salami, J. Pavon, O. Unsal, and A. Cristal, “A novel FPGA-based high throughput accelerator for binary search trees,” in2019 International Conference on High Performance Computing & Simulation (HPCS). IEEE, Jul. 2019, pp. 612–619
2019
-
[9]
Hybrid FPGA approach for a b+ tree in a semantic web database system,
D. Heinrich, S. Werner, M. Stelzner, C. Blochwitz, T. Pionteck, and S. Groppe, “Hybrid FPGA approach for a b+ tree in a semantic web database system,” in2015 10th International Symposium on Recon- figurable Communication-centric Systems-on-Chip (ReCoSoC). IEEE, Jun. 2015, pp. 1–8
2015
-
[10]
In- memory database acceleration on FPGAs: a survey,
J. Fang, Y . T. B. Mulder, J. Hidders, J. Lee, and H. P. Hofstee, “In- memory database acceleration on FPGAs: a survey,”The VLDB Journal, vol. 29, no. 1, pp. 33–59, Oct. 2019
2019
-
[11]
A hybrid b+-tree as solution for in- memory indexing on CPU-GPU heterogeneous computing platforms,
A. Shahvarani and H.-A. Jacobsen, “A hybrid b+-tree as solution for in- memory indexing on CPU-GPU heterogeneous computing platforms,” inProceedings of the 2016 International Conference on Management of Data, ser. SIGMOD/PODS’16. ACM, Jun. 2016, pp. 1523–1538
2016
-
[12]
CuART - a CUDA-based, scalable radix-tree lookup and update engine,
M. Koppehel, T. Groth, S. Groppe, and T. Pionteck, “CuART - a CUDA-based, scalable radix-tree lookup and update engine,” in50th International Conference on Parallel Processing, ser. ICPP 2021. ACM, Aug. 2021, pp. 1–10
2021
-
[13]
(2025, Jul.) Mysql 8.4
Oracle. (2025, Jul.) Mysql 8.4. [Online]. Available: https://dev.mysql. com/doc/refman/8.0/en/innodb-index-types.html
2025
-
[14]
(2025) Postgresql 17.5
PostgreSQL Global Development Group. (2025) Postgresql 17.5. [Online]. Available: https://www.postgresql.org/docs/current/ indexes-types.html
2025
-
[15]
(2025, Jan.) Db2 database
IBM Corporation. (2025, Jan.) Db2 database. IBM Corporation. [Online]. Available: https://www.ibm.com/docs/en/db2/12.1.0?topic= indexes-index-structure
2025
-
[16]
(2025, Apr.) Oracle database, database concepts, 21c
Oracle. (2025, Apr.) Oracle database, database concepts, 21c. Oracle. [Online]. Available: https://docs.oracle.com/en/database/oracle/ oracle-database/21/cncpt/indexes-and-index-organized-tables.html
2025
-
[17]
Organization and maintenance of large ordered indexes,
R. Bayer and E. M. McCreight, “Organization and maintenance of large ordered indexes,”Acta Informatica, vol. 1, no. 3, pp. 173–189, 1972
1972
-
[18]
Bingmann
T. Bingmann. (2018) TLX: Collection of sophisticated C++ data structures, algorithms, and miscellaneous helpers. [Online]. Available: https://panthema.net/tlx
2018
-
[19]
Dodge,The concise encyclopedia of statistics
Y . Dodge,The concise encyclopedia of statistics. Springer Science + Business Medi, 2008
2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.