A Pragmatic Approach to Learned Indexing in RocksDB: Targeted Optimizations with Minimal System Modification

Bettina Kemme; Oana Balmau; Olivier Michaud; Shubham Vashisth

arxiv: 2605.23815 · v1 · pith:7UG2B7QNnew · submitted 2026-05-22 · 💻 cs.DB · cs.DC

A Pragmatic Approach to Learned Indexing in RocksDB: Targeted Optimizations with Minimal System Modification

Shubham Vashisth , Olivier Michaud , Bettina Kemme , Oana Balmau This is my paper

Pith reviewed 2026-05-25 02:17 UTC · model grok-4.3

classification 💻 cs.DB cs.DC

keywords learned indexesRocksDBLSM-treeMemtabledatabase indexingindex structuresproduction systemsthroughput optimization

0 comments

The pith

Off-the-shelf learned indexes integrate into RocksDB via Memtable reuse and block-aware disk placement for up to 2.1X read throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether learned indexes that approximate key distributions can be added to an existing production database like RocksDB without a full redesign of its storage engine. It leverages the system's separation of in-memory Memtables from immutable on-disk files to apply different learned indexes at each layer, introducing a reuse mechanism so models retain knowledge when Memtables are replaced during writes. A read-only learned index is adapted to be block-aware for reliable single-I/O disk lookups, all without changing the storage layer or read path. Experiments across large-scale workloads with varied data distributions show concrete gains of 1.5X in write throughput and 2.1X in read throughput over current systems.

Core claim

By deploying off-the-shelf learned indexes separately in Memtables with a reuse mechanism that preserves structural knowledge across instances and replacing the disk index with a block-aware learned index that supports worst-case single-I/O lookups, MountDB achieves up to 1.5X higher write throughput and 2.1X higher read throughput than state-of-the-art systems while requiring no modifications to the storage layer or read path.

What carries the argument

The reuse mechanism that preserves structural knowledge across Memtable instances, combined with the block-aware adaptation of read-only learned indexes for worst-case single-I/O lookups.

If this is right

Up to 1.5X higher write throughput than state-of-the-art systems on large-scale diverse workloads.
Up to 2.1X higher read throughput than state-of-the-art systems on large-scale diverse workloads.
Learned indexes can be integrated into production systems with minimal overhead and no changes to the storage layer or read path.
Established learned indexes can support concurrency and persistence when placed according to the Memtable and disk separation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same targeted placement and reuse pattern could be applied to other LSM-based key-value stores that maintain a similar Memtable-to-disk boundary.
Block-aware learned indexes might improve I/O predictability in systems with varying block sizes or different storage media.
Sustained memory footprint over long-running workloads could be measured to check whether model reuse keeps overall resource use low.

Load-bearing premise

The separation between in-memory Memtables and immutable on-disk files plus the reuse mechanism is sufficient to let off-the-shelf learned indexes support concurrency and persistence without correctness or performance regressions under write-heavy workloads.

What would settle it

A write-heavy workload with frequent Memtable replacements that produces either data inconsistencies, model adaptation failures, or throughput below the B+-tree baseline.

Figures

Figures reproduced from arXiv: 2605.23815 by Bettina Kemme, Oana Balmau, Olivier Michaud, Shubham Vashisth.

**Figure 2.** Figure 2: Cold-start behavior of updatable learned indexes. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: MountDB’s operational flow. The read and writes paths are virtually identical to those of RocksDB. MountDB pushes [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Fence-Key Modeling & (b) Lookup in PGM (Fence) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Cumulative distribution function (CDFs) of real-world datasets, ordered by increasing indexing difficulty (left to right) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Large-Scale Workloads. (a) Average throughput and (b) 90th and 99th percentile latencies across five datasets, each [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Throughput comparison across the six YCSB workloads (A–F) and three datasets (YCSB, Genome, OSM) under (a) [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Impact of variable and mixed value size. (a) Write [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Microbench - Varying Key Distribution. Write throughput of 1) MountDB ( [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Read Performance Analysis. Workload with (a) All data cached (fits in memory), (b) 10% Data cached, (c) I/O-bound [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Overhead of Learned Index Techniques on Back [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

read the original abstract

Learned indexes have emerged as a promising alternative to traditional index structures, offering higher throughput and lower memory usage by approximating the cumulative key distribution function with lightweight models. Despite these benefits, adoption in production systems remains limited, partly because learned indexes that support concurrency and persistence as effectively as, e.g., the B+-Tree, do not yet exist, while many research prototypes introduce substantial complexity. In this paper, we investigate whether off-the-shelf learned indexes can be integrated into a production database with minimal storage-engine redesign. Using RocksDB as a case study, we exploit its separation between in-memory Memtables and immutable on-disk files to deploy specialized indexes at each level. We show that directly applying existing learned indexes is insufficient under write-heavy workloads because frequent Memtable replacement prevents models from fully adapting. To address this, we introduce a reuse mechanism that preserves structural knowledge across Memtable instances. At the storage level, we replace RocksDB's disk index with a learned index without modifying the storage layer or read path. We further adapt a read-only learned index to be block-aware, enabling worst-case single-I/O lookups. We implement these techniques in MountDB, an extension of RocksDB. Experiments on large-scale workloads with diverse data distributions and access patterns show up to 1.5X higher write throughput and 2.1X higher read throughput than state-of-the-art systems, demonstrating that established learned indexes can be integrated into production systems with minimal overhead and substantial performance benefits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows a low-friction way to add learned indexes to RocksDB via Memtable model reuse and a block-aware disk index, but the performance claims rest on thin experimental reporting.

read the letter

The key point is that this work shows learned indexes can be integrated into RocksDB with two targeted changes—model reuse across Memtables and a block-aware disk index—while claiming notable throughput improvements and almost no other engine modifications. The new parts are the reuse mechanism to deal with frequent Memtable swaps under writes, which lets models keep their learned structure, and the block-aware version that enables single-I/O lookups on disk. These are sensible extensions of prior learned index techniques applied specifically to RocksDB's architecture. The paper handles the minimal modification angle well by exploiting the in-memory versus on-disk separation. This avoids the need for full concurrency and persistence support in the learned indexes themselves. The authors are right that many research systems add too much complexity, so focusing on off-the-shelf with small tweaks is a reasonable direction. The main concern is the experiments. Throughput numbers are given but without workload descriptions, baseline details, or error bars, so it's difficult to judge how solid the 1.5X and 2.1X claims are. The no-modification promise for the read path and storage layer depends on the learned index matching the original interfaces exactly; any difference in semantics would force changes, as the stress-test points out. The reuse also needs to maintain correctness under concurrency, which isn't detailed here. This is aimed at systems researchers looking at practical deployment of learned indexes. It would be useful for anyone thinking about production storage engines. I think it merits peer review to get the full implementation and results examined.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes integrating off-the-shelf learned indexes into RocksDB by exploiting the separation between in-memory Memtables and immutable on-disk files. It introduces a reuse mechanism to preserve model knowledge across frequent Memtable replacements under write-heavy workloads and adapts a read-only learned index to be block-aware for worst-case single-I/O lookups. These changes are implemented in MountDB without modifying the storage layer or read path, yielding up to 1.5× higher write throughput and 2.1× higher read throughput than state-of-the-art systems on large-scale workloads with diverse distributions and access patterns.

Significance. If the results hold, the work is significant for demonstrating a pragmatic path to adopt established learned indexes in production systems with targeted, minimal modifications rather than full redesigns. The reuse mechanism and block-aware adaptation address specific barriers (concurrency, persistence, adaptation lag) while crediting the use of off-the-shelf indexes and the Memtable/on-disk separation as enabling strengths. This could lower the barrier to deployment compared to research prototypes that introduce substantial complexity.

major comments (2)

[Abstract, paragraph on Memtable replacement] Abstract, paragraph on Memtable replacement: the reuse mechanism is asserted to solve adaptation lag under write-heavy workloads, yet no detail is given on model-state transfer across replacements while preserving thread-safety and persistence guarantees under concurrent writes. This assumption is load-bearing for the central claim that the Memtable/on-disk separation suffices to support concurrency and persistence without correctness or performance regressions.
[Abstract, storage level paragraph] Abstract, storage-level paragraph: the claim that the disk index is replaced 'without modifying the storage layer or read path' and that a read-only learned index is adapted to be block-aware for single-I/O lookups requires that the learned index expose exactly the same lookup and iterator interfaces (including error bounds, block metadata handling, and semantics) as the original block-based index. Any deviation would force read-path changes, undermining the 'minimal modification' premise and the attribution of the reported throughput gains.

minor comments (1)

The abstract reports throughput numbers without workload details, baseline descriptions, error bars, or data-exclusion rules; the full manuscript should ensure these are explicitly documented in the experimental section to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments highlighting areas where the abstract could better support its claims. We address each point below with clarifications drawn from the full manuscript and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract, paragraph on Memtable replacement] Abstract, paragraph on Memtable replacement: the reuse mechanism is asserted to solve adaptation lag under write-heavy workloads, yet no detail is given on model-state transfer across replacements while preserving thread-safety and persistence guarantees under concurrent writes. This assumption is load-bearing for the central claim that the Memtable/on-disk separation suffices to support concurrency and persistence without correctness or performance regressions.

Authors: The full manuscript (Section 3.2) details the reuse mechanism: model parameters are transferred via a compact serialization of the CDF approximation during Memtable replacement, with the new Memtable initialized from this state to reduce adaptation lag. Thread-safety is achieved through atomic model pointer swaps and snapshot-based reads that prevent concurrent modification visibility; persistence is preserved because Memtable models are transient (SSTables on disk use the separate block-aware index) and recovery rebuilds from WAL without relying on in-memory models. We agree the abstract should briefly reference this transfer process to make the concurrency claim self-contained and will revise the abstract accordingly. revision: yes
Referee: [Abstract, storage level paragraph] Abstract, storage-level paragraph: the claim that the disk index is replaced 'without modifying the storage layer or read path' and that a read-only learned index is adapted to be block-aware for single-I/O lookups requires that the learned index expose exactly the same lookup and iterator interfaces (including error bounds, block metadata handling, and semantics) as the original block-based index. Any deviation would force read-path changes, undermining the 'minimal modification' premise and the attribution of the reported throughput gains.

Authors: Section 4.1 and 4.3 of the manuscript specify that the block-aware adaptation of the read-only learned index (based on an off-the-shelf model) produces outputs that map directly to existing block boundaries and metadata formats, preserving identical lookup, iterator, error-bound, and semantic interfaces. The read-path code paths remain unchanged because the index is a drop-in replacement at the file level; no new error handling or metadata logic is introduced. This compatibility is what enables the reported gains to be attributed to index efficiency rather than interface changes. revision: no

Circularity Check

0 steps flagged

No circularity: empirical integration paper with no derivations or fitted predictions

full rationale

The paper describes a systems engineering effort to integrate off-the-shelf learned indexes into RocksDB by exploiting existing Memtable/on-disk separation, adding a reuse mechanism, and adapting a read-only index for block awareness. No equations, parameter fitting, or predictive claims appear in the abstract or described approach; performance numbers (1.5X write, 2.1X read) are reported from experiments rather than derived from any model. The central claim rests on implementation details and benchmarking, not on any self-referential reduction, self-citation chain, or renaming of known results. This is a standard empirical contribution whose validity is assessed by reproduction of the reported throughput gains, not by internal logical closure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are explicitly stated or derivable from the provided text.

pith-pipeline@v0.9.0 · 5813 in / 1185 out tokens · 18623 ms · 2026-05-25T02:17:23.898287+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

[1]

Abdullah Al-Mamun, Hao Wu, Qiyang He, Jianguo Wang, and Walid G. Aref

work page
[2]

ACM Comput

A Survey of Learned Indexes for the Multi-dimensional Space. ACM Comput. Surv. (2025). https://doi.org/10.1145/3768575

work page doi:10.1145/3768575 2025
[3]

Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan Gupta, Ravishankar Chand- hiramoorthi, and Diego Didona. 2019. SILK: Preventing latency spikes in Log- Structured merge Key-Value stores. InUSENIX Annual Technical Conference

work page 2019
[4]

Oana Balmau, Rachid Guerraoui, Vasileios Trigonakis, and Igor Zablotchi. 2017. FloDB: Unlocking Memory in Persistent Key-Value Stores. InProceedings of the 12th European Conference on Computer Systems, EuroSys

work page 2017
[5]

Zhichao Cao, Siying Dong, Sagar Vemuri, and David H. C. Du. 2020. Char- acterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook. In 18th USENIX Conference on File and Storage Technologies, FAST 2020, Santa Clara, CA, USA, February 24-27, 2020, Sam H. Noh and Brent Welch (Eds.). USENIX Association, 209–223. https://www.usenix.org/con...

work page 2020
[6]

Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing

work page 2010
[7]

Arpaci-Dusseau, and Remzi H

Yifan Dai, Yien Xu, Aishwarya Ganesan, Ramnatthan Alagappan, Brian Kroth, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2020. From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). https: //www.usenix.org/conference/osdi20/presentation/dai

work page 2020
[8]

Lomet, and Tim Kraska

Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David B. Lomet, and Tim Kraska. 2020. ALEX: An Updatable Adaptive Learned Index. In Proceedings of the International Conference on Management of Data (SIGMOD). https://doi.org/10.1145/3318464.3389711

work page doi:10.1145/3318464.3389711 2020
[9]

Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, and Michael Strum. 2017. Optimizing Space Amplification in RocksDB. In 8th Biennial Conference on Innovative Data Systems Research (CIDR). https:// cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf

work page 2017
[10]

Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. 2020. Why Are Learned Indexes So Effective?. In Proceedings of the 37th International Conference on Machine Learning (ICML), Vol. 119. https://proceedings.mlr.press/v119/ ferragina20a.html

work page 2020
[11]

Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 13, 8 (2020). https://doi.org/10.14778/3389133.3389135

work page doi:10.14778/3389133.3389135 2020
[12]

Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. FITing-Tree: A Data-aware Index Structure. In Proceedings of the International Conference on Management of Data (SIGMOD). https: //doi.org/10.1145/3299869.3319860

work page doi:10.1145/3299869.3319860 2019
[13]

Jiake Ge, Boyu Shi, Yanfeng Chai, Yuanhui Luo, Yunda Guo, Yinxuan He, and Yunpeng Chai. 2023. Cutting Learned Index into Pieces: An In-depth Inquiry into Updatable Learned Indexes. In 39th IEEE International Conference on Data Engineering (ICDE). https://doi.org/10.1109/ICDE55515.2023.00031

work page doi:10.1109/icde55515.2023.00031 2023
[14]

Sanjay Ghemawhat, Jeff Dean, Chris Mumford, David Grogan, and Victor Costan

work page
[15]

https://github.com/google/leveldb

LevelDB. https://github.com/google/leveldb

work page
[16]

Alireza Heidari, Amirhossein Ahmadi, and Wei Zhang. 2025. DobLIX: A Dual- Objective Learned Index for Log-Structured Merge Trees. Proc. VLDB Endow. 18, 11 (2025). https://doi.org/10.14778/3749646.3749667

work page doi:10.14778/3749646.3749667 2025
[17]

Alireza Heidari, Amirhossein Ahmadi, and Wei Zhang. 2025. DobLIX: A Dual- Objective Learned Index for Log-Structured Merge Trees. https://github.com/ ah89/DobLIX

work page 2025
[18]

Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on Machine Learning for Systems (2019). https: //doi.org/10.48550/arXiv.1911.13014

work page doi:10.48550/arxiv.1911.13014 2019
[19]

Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H Chi, Jialin Ding, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2021. Sagedb: A learned database system. (2021)

work page 2021
[20]

Chi, Jeffrey Dean, and Neoklis Polyzotis

Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the International Conference on Management of Data (SIGMOD). https://doi.org/10.1145/3183713. 3196909

work page doi:10.1145/3183713 2018
[21]

Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized struc- tured storage system. ACM SIGOPS Oper. Syst. Rev. 44, 2 (2010). https: //doi.org/10.1145/1773912.1773922

work page doi:10.1145/1773912.1773922 2010
[22]

Shane Culpepper, and Renata Borovica-Gajic

Hai Lan, Zhifeng Bao, J. Shane Culpepper, and Renata Borovica-Gajic. 2023. Updatable Learned Indexes Meet Disk-Resident DBMS - From Evaluations to Design Choices. Proc. ACM Manag. Data 1, 2 (2023). https://doi.org/10.1145/ 3589284

work page 2023
[23]

Shane Culpepper, Renata Borovica-Gajic, and Yu Dong

Hai Lan, Zhifeng Bao, J. Shane Culpepper, Renata Borovica-Gajic, and Yu Dong

work page
[24]

In 40th IEEE International Conference on Data Engineering (ICDE)

A Fully On-Disk Updatable Learned Index. In 40th IEEE International Conference on Data Engineering (ICDE). https://doi.org/10.1109/ICDE60146. 2024.00369

work page doi:10.1109/icde60146 2024
[25]

Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory databases. In 29th IEEE International Conference on Data Engineering (ICDE). https://doi.org/10.1109/ICDE.2013. 6544812

work page doi:10.1109/icde.2013 2013
[26]

Baotong Lu, Jialin Ding, Eric Lo, Umar Farooq Minhas, and Tianzheng Wang

work page
[27]

APEX: A High-Performance Learned Index on Persistent Memory. Proc. VLDB Endow. 15, 3 (2021). https://www.vldb.org/pvldb/vol15/p597-lu.pdf

work page 2021
[28]

Kai Lu. 2022. TridentKV: A Read-Optimized LSM-Tree Based KV Store via Adap- tive Indexing and Space-Efficient Partitioning. https://github.com/emperorlu/ Learned-RocksDB

work page 2022
[29]

Kai Lu, Nannan Zhao, Jiguang Wan, Changhong Fei, Wei Zhao, and Tongliang Deng. 2022. TridentKV: A Read-Optimized LSM-Tree Based KV Store via Adap- tive Indexing and Space-Efficient Partitioning. IEEE Trans. Parallel Distributed Syst. 33, 8 (2022). https://doi.org/10.1109/TPDS.2021.3118599

work page doi:10.1109/tpds.2021.3118599 2022
[30]

Arpaci-Dusseau, and Remzi H

Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Hariharan Gopalakrishnan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. WiscKey: Separating Keys from Values in SSD-Conscious Storage. ACM Trans. Storage 13, 1 (2017). https://doi.org/10.1145/3033273

work page doi:10.1145/3033273 2017
[31]

Lailong Luo, Deke Guo, Richard T. B. Ma, Ori Rottenstreich, and Xueshan Luo

work page
[32]

IEEE Commun

Optimizing Bloom Filter: Challenges, Solutions, and Comparisons. IEEE Commun. Surv. Tutorials 21, 2 (2019). https://doi.org/10.1109/COMST.2018. 2889329

work page doi:10.1109/comst.2018 2019
[33]

Yuxuan Mo and Yu Hua. 2025. LOFT: A Lock-free and Adaptive Learned In- dex with High Scalability for Dynamic Workloads. In Proceedings of the 20th European Conference on Computer Systems, EuroSys. https://doi.org/10.1145/ 3689031.3717458

work page arXiv 2025
[34]

O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J

Patrick E. O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O’Neil. 1996. The Log-Structured Merge-Tree (LSM-Tree).Acta Informatica 33, 4 (1996). https: //doi.org/10.1007/s002360050048

work page doi:10.1007/s002360050048 1996
[35]

William W. Pugh. 1990. Skip Lists: A Probabilistic Alternative to Balanced Trees. Commun. ACM 33, 6 (1990). https://doi.org/10.1145/78973.78977

work page doi:10.1145/78973.78977 1990
[36]

Pandian Raju, Rohan Kadekodi, Vijay Chidambaram, and Ittai Abraham. 2017. PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP). https://doi.org/10.1145/3132747.3132765

work page doi:10.1145/3132747.3132765 2017
[37]

Subhadeep Sarkar, Dimitris Staratzis, Zichen Zhu, and Manos Athanassoulis. 2021. Constructing and Analyzing the LSM Compaction Design Space. Proc. VLDB Endow. 14, 11 (2021), 2216–2229. https://doi.org/10.14778/3476249.3476274

work page doi:10.14778/3476249.3476274 2021
[38]

Benjamin Spector, Andreas Kipf, Kapil Vaidya, Chi Wang, Umar Farooq Minhas, and Tim Kraska. 2021. Bounding the Last Mile: Efficient Learned String Indexing. CoRR abs/2111.14905 (2021). arXiv:2111.14905 https://arxiv.org/abs/2111.14905

work page arXiv 2021
[39]

Speedb, Inc. 2022. Speedb: RocksDB-compatible high-performance storage engine. https://github.com/speedb-io/speedb

work page 2022
[40]

Zhaoyan Sun, Xuanhe Zhou, and Guoliang Li. 2023. Learned index: A compre- hensive experimental evaluation. Proc. VLDB Endow. 16, 8 (2023)

work page 2023
[41]

Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. 2020. CockroachDB: The Resilient Geo-Distributed SQL Database. In Proceedings of the 2020 International Conferen...

work page arXiv 2020
[42]

Chuzhe Tang, Youyun Wang, Zhiyuan Dong, Gansen Hu, Zhaoguo Wang, Minjie Wang, and Haibo Chen. 2020. XIndex: a scalable learned index for multicore data storage. In PPoPP ’20:25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. https://doi.org/10.1145/3332466.3374547

work page doi:10.1145/3332466.3374547 2020
[43]

Youyun Wang, Chuzhe Tang, Zhaoguo Wang, and Haibo Chen. 2020. SIn- dex: a scalable learned index for string keys. In APSys ’20:11th ACM SIGOPS Asia-Pacific Workshop on Systems, Tsukuba, Japan, August 24-25, 2020, Taesoo Kim and Patrick P. C. Lee (Eds.). ACM, 17–24. https://doi.org/10.1145/3409963. 3410496

work page doi:10.1145/3409963 2020
[44]

Yi Wang, Jianan Yuan, Shangyu Wu, Huan Liu, Jiaxian Chen, Chenlin Ma, and Jianbin Qin. 2024. LeaderKV: Improving Read Performance of KV Stores via Learned Index and Decoupled KV Table. In 40th IEEE International Conference on Data Engineering (ICDE). https://doi.org/10.1109/ICDE60146.2024.00010

work page doi:10.1109/icde60146.2024.00010 2024
[45]

Chaichon Wongkham, Baotong Lu, Chris Liu, Zhicong Zhong, Eric Lo, and Tianzheng Wang. 2022. Are Updatable Learned Indexes Ready? Proc. VLDB Endow. 15, 11 (2022). https://doi.org/10.14778/3551793.3551848

work page doi:10.14778/3551793.3551848 2022
[46]

Jiacheng Wu, Yong Zhang, Shimin Chen, Yu Chen, Jin Wang, and Chunxiao Xing

work page
[47]

Updatable Learned Index with Precise Positions. Proc. VLDB Endow. 14, 8 (2021). https://doi.org/10.14778/3457390.3457393

work page doi:10.14778/3457390.3457393 2021
[48]

Qing Xie, Chaoyi Pang, Xiaofang Zhou, Xiangliang Zhang, and Ke Deng. 2014. Maximum error-bounded Piecewise Linear Representation for online stream approximation. VLDB J. 23, 6 (2014). https://doi.org/10.1007/s00778-014-0355-0

work page doi:10.1007/s00778-014-0355-0 2014
[49]

Yifan Yang and Shimin Chen. 2024. LITS: An Optimized Learned Index for Strings. Proc. VLDB Endow. 17, 11 (2024), 3415–3427. https://doi.org/10.14778/ 3681954.3682010

work page arXiv 2024
[50]

Jiaoyi Zhang, Kai Su, and Huanchen Zhang. 2024. Making In-Memory Learned Indexes Efficient on Disk. Proc. ACM Manag. Data 2, 3 (2024). https://doi.org/ 10.1145/3654954

work page doi:10.1145/3654954 2024
[51]

Yong Zhang, Xinran Xiong, and Oana Balmau. 2022. TONE: cutting tail- latency in learned indexes. InCHEOPS@EuroSys: Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems. https://doi.org/10.1145/3503646.3524295

work page doi:10.1145/3503646.3524295 2022
[52]

Weihong Zhou and Shiyu Yang. 2024. SLIPP: A Space-Efficient Learned Index for String Keys. In Proceedings of the 6th International Conference on Big-data Service and Intelligent Computation, BDSIC 2024, Hong Kong, Hong Kong, May 29-31, 2024. ACM, 69–77. https://doi.org/10.1145/3686540.3686550

work page doi:10.1145/3686540.3686550 2024

[1] [1]

Abdullah Al-Mamun, Hao Wu, Qiyang He, Jianguo Wang, and Walid G. Aref

work page

[2] [2]

ACM Comput

A Survey of Learned Indexes for the Multi-dimensional Space. ACM Comput. Surv. (2025). https://doi.org/10.1145/3768575

work page doi:10.1145/3768575 2025

[3] [3]

Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan Gupta, Ravishankar Chand- hiramoorthi, and Diego Didona. 2019. SILK: Preventing latency spikes in Log- Structured merge Key-Value stores. InUSENIX Annual Technical Conference

work page 2019

[4] [4]

Oana Balmau, Rachid Guerraoui, Vasileios Trigonakis, and Igor Zablotchi. 2017. FloDB: Unlocking Memory in Persistent Key-Value Stores. InProceedings of the 12th European Conference on Computer Systems, EuroSys

work page 2017

[5] [5]

Zhichao Cao, Siying Dong, Sagar Vemuri, and David H. C. Du. 2020. Char- acterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook. In 18th USENIX Conference on File and Storage Technologies, FAST 2020, Santa Clara, CA, USA, February 24-27, 2020, Sam H. Noh and Brent Welch (Eds.). USENIX Association, 209–223. https://www.usenix.org/con...

work page 2020

[6] [6]

Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing

work page 2010

[7] [7]

Arpaci-Dusseau, and Remzi H

Yifan Dai, Yien Xu, Aishwarya Ganesan, Ramnatthan Alagappan, Brian Kroth, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2020. From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). https: //www.usenix.org/conference/osdi20/presentation/dai

work page 2020

[8] [8]

Lomet, and Tim Kraska

Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David B. Lomet, and Tim Kraska. 2020. ALEX: An Updatable Adaptive Learned Index. In Proceedings of the International Conference on Management of Data (SIGMOD). https://doi.org/10.1145/3318464.3389711

work page doi:10.1145/3318464.3389711 2020

[9] [9]

Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, and Michael Strum. 2017. Optimizing Space Amplification in RocksDB. In 8th Biennial Conference on Innovative Data Systems Research (CIDR). https:// cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf

work page 2017

[10] [10]

Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. 2020. Why Are Learned Indexes So Effective?. In Proceedings of the 37th International Conference on Machine Learning (ICML), Vol. 119. https://proceedings.mlr.press/v119/ ferragina20a.html

work page 2020

[11] [11]

Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 13, 8 (2020). https://doi.org/10.14778/3389133.3389135

work page doi:10.14778/3389133.3389135 2020

[12] [12]

Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. FITing-Tree: A Data-aware Index Structure. In Proceedings of the International Conference on Management of Data (SIGMOD). https: //doi.org/10.1145/3299869.3319860

work page doi:10.1145/3299869.3319860 2019

[13] [13]

Jiake Ge, Boyu Shi, Yanfeng Chai, Yuanhui Luo, Yunda Guo, Yinxuan He, and Yunpeng Chai. 2023. Cutting Learned Index into Pieces: An In-depth Inquiry into Updatable Learned Indexes. In 39th IEEE International Conference on Data Engineering (ICDE). https://doi.org/10.1109/ICDE55515.2023.00031

work page doi:10.1109/icde55515.2023.00031 2023

[14] [14]

Sanjay Ghemawhat, Jeff Dean, Chris Mumford, David Grogan, and Victor Costan

work page

[15] [15]

https://github.com/google/leveldb

LevelDB. https://github.com/google/leveldb

work page

[16] [16]

Alireza Heidari, Amirhossein Ahmadi, and Wei Zhang. 2025. DobLIX: A Dual- Objective Learned Index for Log-Structured Merge Trees. Proc. VLDB Endow. 18, 11 (2025). https://doi.org/10.14778/3749646.3749667

work page doi:10.14778/3749646.3749667 2025

[17] [17]

Alireza Heidari, Amirhossein Ahmadi, and Wei Zhang. 2025. DobLIX: A Dual- Objective Learned Index for Log-Structured Merge Trees. https://github.com/ ah89/DobLIX

work page 2025

[18] [18]

Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on Machine Learning for Systems (2019). https: //doi.org/10.48550/arXiv.1911.13014

work page doi:10.48550/arxiv.1911.13014 2019

[19] [19]

Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H Chi, Jialin Ding, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2021. Sagedb: A learned database system. (2021)

work page 2021

[20] [20]

Chi, Jeffrey Dean, and Neoklis Polyzotis

Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the International Conference on Management of Data (SIGMOD). https://doi.org/10.1145/3183713. 3196909

work page doi:10.1145/3183713 2018

[21] [21]

Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized struc- tured storage system. ACM SIGOPS Oper. Syst. Rev. 44, 2 (2010). https: //doi.org/10.1145/1773912.1773922

work page doi:10.1145/1773912.1773922 2010

[22] [22]

Shane Culpepper, and Renata Borovica-Gajic

Hai Lan, Zhifeng Bao, J. Shane Culpepper, and Renata Borovica-Gajic. 2023. Updatable Learned Indexes Meet Disk-Resident DBMS - From Evaluations to Design Choices. Proc. ACM Manag. Data 1, 2 (2023). https://doi.org/10.1145/ 3589284

work page 2023

[23] [23]

Shane Culpepper, Renata Borovica-Gajic, and Yu Dong

Hai Lan, Zhifeng Bao, J. Shane Culpepper, Renata Borovica-Gajic, and Yu Dong

work page

[24] [24]

In 40th IEEE International Conference on Data Engineering (ICDE)

A Fully On-Disk Updatable Learned Index. In 40th IEEE International Conference on Data Engineering (ICDE). https://doi.org/10.1109/ICDE60146. 2024.00369

work page doi:10.1109/icde60146 2024

[25] [25]

Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory databases. In 29th IEEE International Conference on Data Engineering (ICDE). https://doi.org/10.1109/ICDE.2013. 6544812

work page doi:10.1109/icde.2013 2013

[26] [26]

Baotong Lu, Jialin Ding, Eric Lo, Umar Farooq Minhas, and Tianzheng Wang

work page

[27] [27]

APEX: A High-Performance Learned Index on Persistent Memory. Proc. VLDB Endow. 15, 3 (2021). https://www.vldb.org/pvldb/vol15/p597-lu.pdf

work page 2021

[28] [28]

Kai Lu. 2022. TridentKV: A Read-Optimized LSM-Tree Based KV Store via Adap- tive Indexing and Space-Efficient Partitioning. https://github.com/emperorlu/ Learned-RocksDB

work page 2022

[29] [29]

Kai Lu, Nannan Zhao, Jiguang Wan, Changhong Fei, Wei Zhao, and Tongliang Deng. 2022. TridentKV: A Read-Optimized LSM-Tree Based KV Store via Adap- tive Indexing and Space-Efficient Partitioning. IEEE Trans. Parallel Distributed Syst. 33, 8 (2022). https://doi.org/10.1109/TPDS.2021.3118599

work page doi:10.1109/tpds.2021.3118599 2022

[30] [30]

Arpaci-Dusseau, and Remzi H

Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Hariharan Gopalakrishnan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. WiscKey: Separating Keys from Values in SSD-Conscious Storage. ACM Trans. Storage 13, 1 (2017). https://doi.org/10.1145/3033273

work page doi:10.1145/3033273 2017

[31] [31]

Lailong Luo, Deke Guo, Richard T. B. Ma, Ori Rottenstreich, and Xueshan Luo

work page

[32] [32]

IEEE Commun

Optimizing Bloom Filter: Challenges, Solutions, and Comparisons. IEEE Commun. Surv. Tutorials 21, 2 (2019). https://doi.org/10.1109/COMST.2018. 2889329

work page doi:10.1109/comst.2018 2019

[33] [33]

Yuxuan Mo and Yu Hua. 2025. LOFT: A Lock-free and Adaptive Learned In- dex with High Scalability for Dynamic Workloads. In Proceedings of the 20th European Conference on Computer Systems, EuroSys. https://doi.org/10.1145/ 3689031.3717458

work page arXiv 2025

[34] [34]

O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J

Patrick E. O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O’Neil. 1996. The Log-Structured Merge-Tree (LSM-Tree).Acta Informatica 33, 4 (1996). https: //doi.org/10.1007/s002360050048

work page doi:10.1007/s002360050048 1996

[35] [35]

William W. Pugh. 1990. Skip Lists: A Probabilistic Alternative to Balanced Trees. Commun. ACM 33, 6 (1990). https://doi.org/10.1145/78973.78977

work page doi:10.1145/78973.78977 1990

[36] [36]

Pandian Raju, Rohan Kadekodi, Vijay Chidambaram, and Ittai Abraham. 2017. PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP). https://doi.org/10.1145/3132747.3132765

work page doi:10.1145/3132747.3132765 2017

[37] [37]

Subhadeep Sarkar, Dimitris Staratzis, Zichen Zhu, and Manos Athanassoulis. 2021. Constructing and Analyzing the LSM Compaction Design Space. Proc. VLDB Endow. 14, 11 (2021), 2216–2229. https://doi.org/10.14778/3476249.3476274

work page doi:10.14778/3476249.3476274 2021

[38] [38]

Benjamin Spector, Andreas Kipf, Kapil Vaidya, Chi Wang, Umar Farooq Minhas, and Tim Kraska. 2021. Bounding the Last Mile: Efficient Learned String Indexing. CoRR abs/2111.14905 (2021). arXiv:2111.14905 https://arxiv.org/abs/2111.14905

work page arXiv 2021

[39] [39]

Speedb, Inc. 2022. Speedb: RocksDB-compatible high-performance storage engine. https://github.com/speedb-io/speedb

work page 2022

[40] [40]

Zhaoyan Sun, Xuanhe Zhou, and Guoliang Li. 2023. Learned index: A compre- hensive experimental evaluation. Proc. VLDB Endow. 16, 8 (2023)

work page 2023

[41] [41]

Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. 2020. CockroachDB: The Resilient Geo-Distributed SQL Database. In Proceedings of the 2020 International Conferen...

work page arXiv 2020

[42] [42]

Chuzhe Tang, Youyun Wang, Zhiyuan Dong, Gansen Hu, Zhaoguo Wang, Minjie Wang, and Haibo Chen. 2020. XIndex: a scalable learned index for multicore data storage. In PPoPP ’20:25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. https://doi.org/10.1145/3332466.3374547

work page doi:10.1145/3332466.3374547 2020

[43] [43]

Youyun Wang, Chuzhe Tang, Zhaoguo Wang, and Haibo Chen. 2020. SIn- dex: a scalable learned index for string keys. In APSys ’20:11th ACM SIGOPS Asia-Pacific Workshop on Systems, Tsukuba, Japan, August 24-25, 2020, Taesoo Kim and Patrick P. C. Lee (Eds.). ACM, 17–24. https://doi.org/10.1145/3409963. 3410496

work page doi:10.1145/3409963 2020

[44] [44]

Yi Wang, Jianan Yuan, Shangyu Wu, Huan Liu, Jiaxian Chen, Chenlin Ma, and Jianbin Qin. 2024. LeaderKV: Improving Read Performance of KV Stores via Learned Index and Decoupled KV Table. In 40th IEEE International Conference on Data Engineering (ICDE). https://doi.org/10.1109/ICDE60146.2024.00010

work page doi:10.1109/icde60146.2024.00010 2024

[45] [45]

Chaichon Wongkham, Baotong Lu, Chris Liu, Zhicong Zhong, Eric Lo, and Tianzheng Wang. 2022. Are Updatable Learned Indexes Ready? Proc. VLDB Endow. 15, 11 (2022). https://doi.org/10.14778/3551793.3551848

work page doi:10.14778/3551793.3551848 2022

[46] [46]

Jiacheng Wu, Yong Zhang, Shimin Chen, Yu Chen, Jin Wang, and Chunxiao Xing

work page

[47] [47]

Updatable Learned Index with Precise Positions. Proc. VLDB Endow. 14, 8 (2021). https://doi.org/10.14778/3457390.3457393

work page doi:10.14778/3457390.3457393 2021

[48] [48]

Qing Xie, Chaoyi Pang, Xiaofang Zhou, Xiangliang Zhang, and Ke Deng. 2014. Maximum error-bounded Piecewise Linear Representation for online stream approximation. VLDB J. 23, 6 (2014). https://doi.org/10.1007/s00778-014-0355-0

work page doi:10.1007/s00778-014-0355-0 2014

[49] [49]

Yifan Yang and Shimin Chen. 2024. LITS: An Optimized Learned Index for Strings. Proc. VLDB Endow. 17, 11 (2024), 3415–3427. https://doi.org/10.14778/ 3681954.3682010

work page arXiv 2024

[50] [50]

Jiaoyi Zhang, Kai Su, and Huanchen Zhang. 2024. Making In-Memory Learned Indexes Efficient on Disk. Proc. ACM Manag. Data 2, 3 (2024). https://doi.org/ 10.1145/3654954

work page doi:10.1145/3654954 2024

[51] [51]

Yong Zhang, Xinran Xiong, and Oana Balmau. 2022. TONE: cutting tail- latency in learned indexes. InCHEOPS@EuroSys: Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems. https://doi.org/10.1145/3503646.3524295

work page doi:10.1145/3503646.3524295 2022

[52] [52]

Weihong Zhou and Shiyu Yang. 2024. SLIPP: A Space-Efficient Learned Index for String Keys. In Proceedings of the 6th International Conference on Big-data Service and Intelligent Computation, BDSIC 2024, Hong Kong, Hong Kong, May 29-31, 2024. ACM, 69–77. https://doi.org/10.1145/3686540.3686550

work page doi:10.1145/3686540.3686550 2024