pith. sign in

arxiv: 2605.29061 · v1 · pith:IKR7BSM2new · submitted 2026-05-27 · 💻 cs.DS · cs.DB

Residual-Entropy Accounting for Routed Atom-Budgeted Learned Indexes

Pith reviewed 2026-06-29 09:03 UTC · model grok-4.3

classification 💻 cs.DS cs.DB
keywords learned indexesresidual entropyaccounting theorempredecessor searchrank searchcertified repairatom budgetpiecewise linear
0
0 comments X

The pith

Conditional residual answer entropy determines the query-time scale in routed atom-budgeted learned indexes with certified repair.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines exact predecessor and rank search inside a routed architecture where a directory sends each query to a contiguous interval, a counted local predictor supplies a certified rank window, and exact repair finishes the answer by comparisons. It introduces conditional residual answer entropy as the uncertainty left in the exact answer once the leaf, predictor output, certificate, and charged pre-repair information are known. The central result is a two-sided accounting theorem that equates this entropy to the scale of query time under the given local predictor-atom budget. Directory space, sorted-array storage, and transcript-indexed repair-program space are kept as separate costs, so the theorem supplies an accounting relation rather than a single byte-level bound. The work also derives a rank-spread specialization and reports concrete RGapM and GapM values plus benchmarks for counted piecewise-linear segments on real data sets.

Core claim

We prove a two-sided accounting theorem showing that conditional residual answer entropy gives the query-time scale under the stated routed, atom-budgeted, certified-repair architecture and local predictor-atom budget, with directory space, sorted-array storage, and transcript-indexed repair-program space treated as separate system costs.

What carries the argument

The two-sided accounting theorem that equates conditional residual answer entropy to query-time scale.

If this is right

  • Query time is governed by the entropy remaining after the predictor transcript and certificate are observed.
  • For counted piecewise-linear segments the profile term becomes non-oracular through a shadow-price allocation rule.
  • Finite-instance RGapM and GapM values can be computed on real SOSD and Zenodo samples.
  • Benchmarks against PGM-index, RadixSpline, and binary search reveal overheads and bottlenecks rather than overall speed claims.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The entropy accounting could guide atom-budget allocation across multiple predictor types if the certificate and repair structure stays fixed.
  • Treating repair-program space as a distinct cost allows explicit trade-offs between precomputed repair transcripts and on-line comparisons.
  • The rank-spread specialization may apply when residual ranks after the transcript remain uniformly likely.

Load-bearing premise

The result holds only inside the architecture that routes via directory to contiguous intervals, uses counted local predictors for certified rank windows, resolves the rest by exact repair, and keeps the three space costs separate.

What would settle it

An implementation of the routed atom-budgeted certified-repair architecture in which the measured number of repair comparisons deviates systematically from the computed conditional residual answer entropy for the same atom budget.

read the original abstract

We study exact predecessor and rank search in a routed, atom-budgeted, certified-repair learned-index architecture. An ordered directory routes each query to a contiguous interval, a counted local predictor returns a certified rank window, and exact repair resolves the remaining uncertainty by comparisons. The result is scoped to this architecture and does not claim guarantees for arbitrary learned-index designs such as unconstrained RMI dispatch, hash routing, neural routing, or exact-payload indexes without additional accounting. The main parameter is conditional residual answer entropy: the entropy of the exact answer after the leaf, predictor output, certificate, and charged pre-repair information are observed. We prove a two-sided accounting theorem showing that this functional gives the query-time scale under the stated architecture and local predictor-atom budget. Directory space, sorted-array storage, and transcript-indexed repair-program space are treated as separate system costs, so the theorem is not a byte-level space lower bound or a compact implementation recipe. We also give a rank-spread specialization in which the radius term log(1 + Delta) is valid only when many residual ranks remain likely after the predictor transcript is known. For counted piecewise-linear segments, we make the profile term non-oracular, derive a shadow-price allocation rule, compute finite-instance RGapM and GapM values on real SOSD and Zenodo samples, and report benchmarks against PGM-index, RadixSpline, and binary search. The benchmarks expose overheads and bottlenecks rather than claiming speed for the shadow prototype.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript studies exact predecessor and rank search in a routed, atom-budgeted, certified-repair learned-index architecture consisting of an ordered directory, counted local predictor returning a certified rank window, and exact repair. The central claim is a two-sided accounting theorem equating conditional residual answer entropy (entropy of the exact answer after observing the leaf, predictor output, certificate, and charged pre-repair information) to query-time scale under the local predictor-atom budget. Directory, sorted-array, and transcript-indexed repair-program space are treated as separate costs. Additional contributions include a rank-spread specialization (with radius term log(1 + Delta) valid only when many residual ranks remain likely), a shadow-price allocation rule making the profile term non-oracular for counted piecewise-linear segments, finite-instance RGapM and GapM computations on SOSD and Zenodo samples, and benchmarks against PGM-index, RadixSpline, and binary search that expose overheads rather than claiming superiority.

Significance. If the scoped theorem holds, it supplies a precise, architecture-specific identity relating residual entropy to query time while cleanly separating distinct space costs; this is a useful theoretical tool for analyzing this class of learned indexes. The finite-instance empirical values and benchmark exposure of bottlenecks provide concrete data points on practical overheads within the stated model.

major comments (2)
  1. [Abstract / empirical evaluation] Abstract and empirical section: The reported finite-instance RGapM and GapM values on SOSD and Zenodo samples are presented without error bars, confidence intervals, sample-size justification, or explicit exclusion criteria; because these values are used to illustrate the accounting in concrete cases, the absence of statistical characterization weakens the empirical support for the theorem's practical relevance.
  2. [Theorem on residual-entropy accounting] Theorem statement (two-sided accounting): The derivation that conditional residual answer entropy exactly equals query-time scale must be shown to hold without introducing architecture-dependent fitted parameters beyond those already declared; if the proof relies on the counted local predictor and certified rank window, the steps establishing the two-sided bound should be expanded to confirm no circular reduction to the functional itself occurs.
minor comments (3)
  1. [Rank-spread specialization] The rank-spread specialization should include a precise, quantitative definition of the phrase 'many residual ranks remain likely' to make the validity condition for log(1 + Delta) operational rather than informal.
  2. [Shadow-price rule] The shadow-price allocation rule for piecewise-linear segments is presented as making the profile term non-oracular; a short worked example on a small instance would clarify how the rule is applied in the finite-instance computations.
  3. [Benchmarks] The benchmarks section would benefit from an explicit statement of the machine model and cost model used for the reported query times to ensure reproducibility of the overhead comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and the recommendation of minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract / empirical evaluation] Abstract and empirical section: The reported finite-instance RGapM and GapM values on SOSD and Zenodo samples are presented without error bars, confidence intervals, sample-size justification, or explicit exclusion criteria; because these values are used to illustrate the accounting in concrete cases, the absence of statistical characterization weakens the empirical support for the theorem's practical relevance.

    Authors: We agree that the empirical section would benefit from statistical characterization. The reported values were provided as concrete illustrations of the accounting theorem on real data rather than as a comprehensive statistical evaluation. In the revised manuscript we will add bootstrap-derived 95% confidence intervals for the RGapM and GapM figures, state the exact sample sizes drawn from each dataset, and document any exclusion criteria applied. revision: yes

  2. Referee: [Theorem on residual-entropy accounting] Theorem statement (two-sided accounting): The derivation that conditional residual answer entropy exactly equals query-time scale must be shown to hold without introducing architecture-dependent fitted parameters beyond those already declared; if the proof relies on the counted local predictor and certified rank window, the steps establishing the two-sided bound should be expanded to confirm no circular reduction to the functional itself occurs.

    Authors: The two-sided accounting is derived directly from the definition of conditional residual answer entropy together with the atom-budget constraint on the counted local predictor and the certified rank window; no additional fitted parameters are introduced. We nevertheless accept that the proof presentation can be strengthened. In the revision we will expand the derivation steps (in the main text or a dedicated appendix) to make explicit the sequence from the predictor transcript and certificate to the entropy bound, confirming that the argument does not reduce circularly to the functional being bounded. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architecture-scoped identity is self-contained

full rationale

The central result is a two-sided accounting theorem scoped explicitly to the routed directory + counted local predictor + certified rank window + exact repair model, with directory/array/repair-program space treated as separate costs. The theorem equates the defined conditional residual answer entropy functional (entropy of exact answer after observing leaf, predictor output, certificate, and charged pre-repair information) to query-time scale under the local predictor-atom budget. No equations, self-citations, or derivations are exhibited in the provided material that reduce this identity to a fitted parameter renamed as prediction, a self-definitional loop, or a load-bearing self-citation chain. The rank-spread specialization and shadow-price rule are presented as further derivations inside the same scoped model. The result is therefore architecture-dependent by explicit statement rather than circular by construction. Benchmarks against external indexes are reported without claiming the theorem itself depends on them.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on the definition of conditional residual answer entropy and the architectural constraints listed in the abstract; no free parameters are fitted to data in the abstract description, and no new entities are postulated.

free parameters (1)
  • conditional residual answer entropy
    Defined as entropy of exact answer after leaf, predictor output, certificate, and pre-repair information; serves as the main parameter for query-time scale.
axioms (1)
  • domain assumption Entropy functional is an appropriate measure of residual uncertainty for query cost
    Invoked as the quantity that gives the query-time scale under the architecture.

pith-pipeline@v0.9.1-grok · 5799 in / 1254 out tokens · 35986 ms · 2026-06-29T09:03:11.720969+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 21 canonical work pages

  1. [1]

    Peyman Afshani, Jérémie Barbay, and Timothy M. Chan. Instance-optimal geometric algorithms.Journal of the ACM, 64(1):4:1–4:38, 2017

  2. [2]

    A lower bound for finding predecessors in Yao’s cell probe model.Combina- torica, 8(3):235–247, 1988

    Miklós Ajtai. A lower bound for finding predecessors in Yao’s cell probe model.Combina- torica, 8(3):235–247, 1988

  3. [3]

    BF-Tree: Approximate tree indexing

    Manos Athanassoulis and Anastasia Ailamaki. BF-Tree: Approximate tree indexing. Proceedings of the VLDB Endowment, 7(14):1881–1892, 2014

  4. [4]

    Paul Beame and Faith E. Fich. Optimal bounds for the predecessor problem and related problems.Journal of Computer and System Sciences, 65(1):38–72, 2002

  5. [5]

    Bent, Daniel D

    Stephen W. Bent, Daniel D. Sleator, and Robert E. Tarjan. Biased search trees.SIAM Journal on Computing, 14(3):545–568, 1985

  6. [6]

    A learned approach to design compressed rank/select data structures.ACM Transactions on Algorithms, 18(3):24:1–24:28, 2022

    Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. A learned approach to design compressed rank/select data structures.ACM Transactions on Algorithms, 18(3):24:1–24:28, 2022

  7. [7]

    Beyond logarithmic bounds: querying in constant expected time with learned indexes

    Luis Alberto Croquevielle, Guang Yang, Liang Liang, Ali Hadian, and Thomas Heinis. Beyond logarithmic bounds: querying in constant expected time with learned indexes. In 28th International Conference on Database Theory (ICDT 2025), LIPIcs 328, 19:1–19:21,

  8. [8]

    doi: 10.4230/LIPIcs.ICDT.2025.19

  9. [9]

    Lower bounds for the algorithmic complexity of learned indexes

    Luis Alberto Croquevielle, Roman Sokolovskii, and Thomas Heinis. Lower bounds for the algorithmic complexity of learned indexes. In29th International Conference on Database Theory (ICDT 2026), LIPIcs 365, 14:1–14:21, 2026. doi: 10.4230/LIPIcs.ICDT.2026.14

  10. [10]

    Demaine, Thouis Jones, and Mihai Pătraşcu

    Erik D. Demaine, Thouis Jones, and Mihai Pătraşcu. Interpolation search for non- independent data. InSODA, pages 522–523, 2004

  11. [11]

    Martin Dietzfelbinger, Anna Karlin, Kurt Mehlhorn, Friedhelm Meyer auf der Heide, Hans Rohnert, and Robert E. Tarjan. Dynamic perfect hashing: upper and lower bounds.SIAM Journal on Computing, 23(4):738–761, 1994

  12. [12]

    ALEX: An updatable adaptive learned index

    Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, and Tim Kraska. ALEX: An updatable adaptive learned index. InProceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 969–984, 2020. doi: 10.1145/3318464.3389711

  13. [13]

    Why are learned indexes so effective? InICML, PMLR 119, pages 3123–3132, 2020

    Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. Why are learned indexes so effective? InICML, PMLR 119, pages 3123–3132, 2020. 42

  14. [14]

    The PGM-index: A fully-dynamic compressed learned index with provable worst-case bounds.Proceedings of the VLDB Endowment, 13(8):1162–1175, 2020

    Paolo Ferragina and Giorgio Vinciguerra. The PGM-index: A fully-dynamic compressed learned index with provable worst-case bounds.Proceedings of the VLDB Endowment, 13(8):1162–1175, 2020. doi: 10.14778/3389133.3389135

  15. [15]

    FL-RMQ: A learned approach to range minimum queries

    Paolo Ferragina and Filippo Lari. FL-RMQ: A learned approach to range minimum queries. InCPM 2025, LIPIcs 331, 7:1–7:23, 2025. doi: 10.4230/LIPIcs.CPM.2025.7

  16. [16]

    Fredman, János Komlós, and Endre Szemerédi

    Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table withO(1) worst case access time.Journal of the ACM, 31(3):538–544, 1984

  17. [17]

    Fredman and Michael E

    Michael L. Fredman and Michael E. Saks. The cell probe complexity of dynamic data structures. InProceedings of the 21st Annual ACM Symposium on Theory of Computing (STOC), pages 345–354, 1989. doi: 10.1145/73007.73040

  18. [18]

    A dynamic piecewise-linear geometric index with worst-case guarantees

    Emil Toftegaard Gæde, Ivor van der Hoog, Eva Rotenberg, and Tord Stordalen. A dynamic piecewise-linear geometric index with worst-case guarantees. In33rd Annual European Symposium on Algorithms (ESA 2025), LIPIcs 351, 64:1–64:18, 2025. doi: 10.4230/LIPIcs.ESA.2025.64

  19. [19]

    FITing-Tree: A data-aware index structure

    Alexandros Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, Tim Kraska, and Umar Farooq Minhas. FITing-Tree: A data-aware index structure. InProceedings of the 2019 ACM SIGMOD International Conference on Management of Data, pages 1189–1206,

  20. [20]

    doi: 10.1145/3299869.3319860

  21. [21]

    Computational-geometric methods for polygonal approximations of a curve.Computer Vision, Graphics, and Image Processing, 36(1):31–41, 1986

    Hiroshi Imai and Masao Iri. Computational-geometric methods for polygonal approximations of a curve.Computer Vision, Graphics, and Image Processing, 36(1):31–41, 1986. doi: 10.1016/S0734-189X(86)80027-5

  22. [22]

    PhD thesis, Carnegie Mellon University, 1989

    Guy Jacobson.Succinct Static Data Structures. PhD thesis, Carnegie Mellon University, 1989

  23. [23]

    RadixSpline: A single-pass learned index

    Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. RadixSpline: A single-pass learned index. InProceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM), pages 5:1–5:5, 2020. doi: 10.1145/3401071.3401659

  24. [24]

    Donald E. Knuth. Optimum binary search trees.Acta Informatica, 1:14–25, 1971

  25. [25]

    Chi, Jeffrey Dean, and Neoklis Polyzotis

    Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. InProceedings of the 2018 International Conference on Management of Data (SIGMOD), pages 489–504, 2018. doi: 10.1145/3183713.3196909

  26. [26]

    Shane Culpepper, and Renata Borovica-Gajic

    Hai Lan, Zhifeng Bao, J. Shane Culpepper, and Renata Borovica-Gajic. Updatable learned indexes meet disk-resident DBMS: from evaluations to design choices. InProceedings of the ACM SIGMOD International Conference on Management of Data, 2023. doi: 10.1145/3589284

  27. [27]

    The adaptive radix tree: ARTful indexing for main-memory databases

    Viktor Leis, Alfons Kemper, and Thomas Neumann. The adaptive radix tree: ARTful indexing for main-memory databases. InICDE, pages 38–49, 2013

  28. [28]

    Why are learned indexes so effective but sometimes ineffective?Proceedings of the VLDB Endowment, 18(9):2886–2898, 2025

    Qiyu Liu, Siyuan Han, Yanlin Qi, and others. Why are learned indexes so effective but sometimes ineffective?Proceedings of the VLDB Endowment, 18(9):2886–2898, 2025. doi: 10.14778/3746405.3746415

  29. [29]

    Understanding robustness issues of updatable learned indexes: experiments and analysis.Proceedings of the ACM on Management of Data, 3(4):270, 2025

    Yuanhui Luo, Minhui Xie, Yiheng Tong, Shichao Jiang, and Yunpeng Chai. Understanding robustness issues of updatable learned indexes: experiments and analysis.Proceedings of the ACM on Management of Data, 3(4):270, 2025. doi: 10.1145/3749188. 43

  30. [30]

    A critical analysis of recursive model indexes.Proceedings of the VLDB Endowment, 15(5):1079–1091, 2022

    Marcel Maltry and Jens Dittrich. A critical analysis of recursive model indexes.Proceedings of the VLDB Endowment, 15(5):1079–1091, 2022

  31. [31]

    CDFShop: Exploring and optimizing learned index structures

    Ryan Marcus, Emily Zhang, and Tim Kraska. CDFShop: Exploring and optimizing learned index structures. InProceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 2789–2792, 2020. doi: 10.1145/3318464.3384706

  32. [32]

    Benchmarking learned indexes.Proceedings of the VLDB Endowment, 14(1):1–13, 2021

    Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. Benchmarking learned indexes.Proceedings of the VLDB Endowment, 14(1):1–13, 2021. doi: 10.14778/3421424.3421425

  33. [33]

    Nearly optimal binary search trees.Acta Informatica, 5:287–295, 1975

    Kurt Mehlhorn. Nearly optimal binary search trees.Acta Informatica, 5:287–295, 1975

  34. [34]

    Cell probe complexity – a survey

    Peter Bro Miltersen. Cell probe complexity – a survey. InAdvances in Data Structures Workshop, 1999

  35. [35]

    Predecessor search.ACM Computing Surveys, 53(5):105:1–105:35, 2020

    Gonzalo Navarro and Matías Rojas-Ledesma. Predecessor search.ACM Computing Surveys, 53(5):105:1–105:35, 2020

  36. [36]

    Overmars.The Design of Dynamic Data Structures

    Mark H. Overmars.The Design of Dynamic Data Structures. Lecture Notes in Computer Science 156. Springer, 1983

  37. [37]

    Mihai Pătraşcu and Erik D. Demaine. Logarithmic lower bounds in the cell-probe model. SIAM Journal on Computing, 35(4):932–963, 2006

  38. [38]

    Time-space trade-offs for predecessor search

    Mihai Pătraşcu and Mikkel Thorup. Time-space trade-offs for predecessor search. InSTOC, pages 232–240, 2006

  39. [39]

    Randomization does not help searching predecessors

    Mihai Pătraşcu and Mikkel Thorup. Randomization does not help searching predecessors. InSODA, pages 555–564, 2007

  40. [40]

    Cell-probe lower bounds for succinct partial sums

    Mihai Pătraşcu and Emanuele Viola. Cell-probe lower bounds for succinct partial sums. In SODA, pages 117–122, 2010

  41. [41]

    Srinivasa Rao

    Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encodingk-ary trees and multisets.ACM Transactions on Algorithms, 3(4):43, 2007

  42. [42]

    Jun Rao and Kenneth A. Ross. Making B+-trees cache conscious in main memory. In SIGMOD, pages 475–486, 2000

  43. [43]

    Sleator and Robert E

    Daniel D. Sleator and Robert E. Tarjan. Self-adjusting binary search trees.Journal of the ACM, 32(3):652–686, 1985

  44. [44]

    Zhang, J

    Lorenzo Bellomo. Sorted unsigned integer datasets. Zenodo, 2025. doi: 10.5281/zen- odo.15240501

  45. [45]

    Kaas, and E

    Peter van Emde Boas, R. Kaas, and E. Zijlstra. Design and implementation of an efficient priority queue.Mathematical Systems Theory, 10:99–127, 1977

  46. [46]

    Are updatable learned indexes ready?Proceedings of the VLDB Endowment, 15(11):3004–3017, 2022

    Chaichon Wongkham, Baotong Lu, Chris Liu, Zhicong Zhong, Eric Lo, and Tianzheng Wang. Are updatable learned indexes ready?Proceedings of the VLDB Endowment, 15(11):3004–3017, 2022. doi: 10.14778/3551793.3551848

  47. [47]

    Updatable learned index with precise positions.Proceedings of the VLDB Endowment, 14(8):1276–1288,

    JiachengWu, YongZhang, ShiminChen, JinWang, YuChen, andChunxiaoXing. Updatable learned index with precise positions.Proceedings of the VLDB Endowment, 14(8):1276–1288,

  48. [48]

    doi: 10.14778/3457390.3457393. 44

  49. [49]

    Kornaropoulos, and Yue Cheng

    Rui Yang, Evgenios M. Kornaropoulos, and Yue Cheng. Algorithmic complexity attacks on dynamic learned indexes.Proceedings of the VLDB Endowment, 17(4):780–792, 2024. doi: 10.14778/3636218.3636232

  50. [50]

    Andrew C.-C. Yao. Probabilistic computations: toward a unified measure of complexity. In FOCS, pages 222–227, 1977

  51. [51]

    On distribution dependent sub-logarithmic query time of learned indexing

    Sepanta Zeighami and Cyrus Shahabi. On distribution dependent sub-logarithmic query time of learned indexing. InICML, PMLR 202, pages 40669–40680, 2023

  52. [52]

    Making in-memory learned indexes efficient on disk.Proceedings of the ACM on Management of Data, 2(3):151:1–151:27, 2024

    Jiaoyi Zhang, Kai Su, and Huanchen Zhang. Making in-memory learned indexes efficient on disk.Proceedings of the ACM on Management of Data, 2(3):151:1–151:27, 2024. doi: 10.1145/3654954. 45