pith. sign in

arxiv: 2510.27588 · v3 · pith:APVXVHWLnew · submitted 2025-10-31 · 💻 cs.DS · cs.DB· cs.LG

Learned Static Function Data Structures

Pith reviewed 2026-05-21 19:53 UTC · model grok-4.3

classification 💻 cs.DS cs.DBcs.LG
keywords static functionslearned data structuresprefix codeszero-order entropymachine learningdata compressionpoint querieskey-value associations
0
0 comments X

The pith

Learned static functions use machine learning to predict per-key value probabilities and build shorter prefix codes, breaking the zero-order entropy barrier for point queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces learned static functions for associating a fixed set of keys with values without storing the keys themselves. Traditional static functions are limited by the zero-order entropy of the value sequence, but here a machine learning model first predicts a probability distribution over possible values for each key. From that distribution the method derives a custom prefix code tailored to the key, then stores the resulting short codeword inside an ordinary static function. If the model finds real correlations between keys and values, the average code length falls below what any single global code could achieve. Experiments report space reductions of up to an order of magnitude on real data sets and three orders of magnitude on synthetic data, all while still answering point queries in constant time.

Core claim

The central claim is that a machine-learning model can capture non-trivial correlations between keys and values, allowing the construction of a key-specific prefix code whose expected length is shorter than the zero-order entropy of the value sequence; these compact codewords are then stored in a conventional static function data structure that supports point queries without materializing the key set.

What carries the argument

Per-key prefix code derived from a machine-learning model's predicted probability distribution over values, stored inside a classic static function.

If this is right

  • Space usage can fall below the zero-order entropy bound while point queries remain supported.
  • Savings scale with the strength of correlations the model can learn.
  • The approach works for any static function that can store the derived codewords.
  • Real-world data sets with natural key-value dependencies yield up to 10x compression; synthetic data with engineered correlations yield up to 1000x compression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prediction-plus-prefix-code idea could be layered on top of existing compressed static functions to obtain further gains.
  • If the model is cheap to evaluate, the method might extend to semi-static or slowly changing data sets without rebuilding from scratch.
  • Applications that already use static functions for dictionaries or Bloom-filter alternatives could adopt this technique to reduce memory footprint.

Load-bearing premise

The machine learning model must capture non-trivial correlations between keys and values so that the per-key prefix codes are shorter on average than any single global code.

What would settle it

Run the construction on a data set in which keys and values are generated independently; if measured space usage remains strictly above the zero-order entropy of the values, the claimed improvement does not occur.

Figures

Figures reproduced from arXiv: 2510.27588 by Giorgio Vinciguerra, Hans-Peter Lehmann, Stefan Hermann, Stefan Walzer.

Figure 1
Figure 1. Figure 1: Architecture of learned static functions. In gray, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Structure of the BuRR equation system [26]. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A VL-BuRR data structure using ℓmax separate 1-bit SFs, given by column vectors 𝒁0, . . . , 𝒁ℓmax−1 (here ℓmax = 6). Shaded areas are two examples of the bits accessed by a single query. Grey dots indicate individual bits, dotted lines indicate how the bits are grouped into words of length𝒘, and numbers indicate the order in which these words are stored. ℓmax instances of 1-bit BuRR, where ℓmax is the leng… view at source ↗
Figure 4
Figure 4. Figure 4: Idealised space overhead of filter-based WRM for [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Query time, inference time (dashed) and space over [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

We consider the task of constructing a data structure for associating a static set of keys with values, while allowing arbitrary output values for queries involving keys outside the set. Compared to hash tables, these so-called static function data structures do not need to store the key set and thus use significantly less memory. Several techniques are known, with compressed static functions approaching the zero-order empirical entropy of the value sequence. In this paper, we introduce learned static functions, which use machine learning to capture correlations between keys and values. For each key, a model predicts a probability distribution over the values, from which we derive a key-specific prefix code to compactly encode the true value. The resulting codeword is stored in a classic static function data structure. This design allows learned static functions to break the zero-order entropy barrier while still supporting point queries. Our experiments show substantial space savings: up to one order of magnitude on real data, and up to three orders of magnitude on synthetic data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces learned static functions for static key-value associations that support point queries. An ML model predicts a per-key distribution over values; a key-specific prefix code is derived from this distribution and the codeword is stored in a standard static-function data structure. The construction is claimed to break the zero-order entropy barrier of the value sequence when the model captures non-trivial key-value correlations. Experiments report concrete space savings of up to 10× on real data and up to three orders of magnitude on synthetic data.

Significance. If the central claim holds, the result is significant for the data-structures community: it supplies a concrete, implementable method that integrates standard ML training with existing static-function primitives to obtain compression strictly below H0 while preserving O(1) queries. The approach is not circular (savings are measured against external baselines) and the reported ratios on both real and synthetic instances provide direct, falsifiable evidence that the inequality can be realized in practice.

major comments (2)
  1. [Experiments] Experiments section: the total-space figures must explicitly include the size of the trained ML model (plus any auxiliary structures) in addition to the static-function storage of the codewords; without this accounting the claim that the construction breaks the zero-order entropy barrier cannot be verified.
  2. [§4] §4 (or equivalent experimental subsection): the paper should state the experimental protocol for instance selection and report whether the quoted savings ratios come from a pre-specified set of runs or were obtained after observing the data; inclusion of standard error bars or multiple independent trials would directly address this.
minor comments (2)
  1. [Abstract] Abstract: the phrases 'up to one order of magnitude' and 'up to three orders of magnitude' would be clearer if the paper distinguished maximum observed savings from average savings across the test suite.
  2. [Construction] Notation: the manuscript should adopt a uniform symbol for the learned distribution (e.g., p̂_v|k) and for the resulting code length to avoid ambiguity when comparing against H0.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the experimental reporting. We address each major comment below and will incorporate the requested clarifications in a revised version of the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the total-space figures must explicitly include the size of the trained ML model (plus any auxiliary structures) in addition to the static-function storage of the codewords; without this accounting the claim that the construction breaks the zero-order entropy barrier cannot be verified.

    Authors: We agree that total-space figures must account for the trained ML model and auxiliary structures to allow verification of the entropy-barrier claim. In the revised manuscript we will add explicit reporting of model sizes (and any auxiliary data) alongside the static-function storage costs, and we will recompute and present the space savings ratios using these complete figures. revision: yes

  2. Referee: [§4] §4 (or equivalent experimental subsection): the paper should state the experimental protocol for instance selection and report whether the quoted savings ratios come from a pre-specified set of runs or were obtained after observing the data; inclusion of standard error bars or multiple independent trials would directly address this.

    Authors: We will expand §4 to describe the instance-selection protocol in detail and to state whether the reported ratios derive from a pre-specified set of runs. Where feasible we will also add results from multiple independent trials together with standard error bars. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's construction trains an ML model on the key-value pairs to output per-key value distributions, derives Huffman-style prefix codes from those distributions, and stores the resulting codewords inside an off-the-shelf static-function data structure whose space is bounded by the sum of the code lengths plus the model size. The claimed improvement below zero-order entropy H0 follows directly from the information-theoretic property that a code whose lengths are -log p_i (where p_i is the model-predicted probability) is shorter on average precisely when the model assigns higher probability to the true value than the uniform or global distribution; this inequality is measured on held-out or synthetic instances against external baselines and does not reduce to a fitted parameter being renamed as a prediction. No load-bearing step invokes a self-citation whose content is itself unverified, nor does any equation equate a derived quantity to its own input by definition. The experiments therefore constitute independent evidence rather than a tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence of learnable key-value correlations and on the correctness of the underlying static function primitive; no new mathematical axioms are introduced.

free parameters (1)
  • ML model architecture and hyperparameters
    Choice of model family and training settings are selected to fit the observed key-value correlations.

pith-pipeline@v0.9.0 · 5698 in / 1047 out tokens · 99866 ms · 2026-05-21T19:53:25.129141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages

  1. [1]

    The Gene Ontology resource: enriching a GOld mine.Nucleic acids research 49, D1 (2021), D325–D334

    2021. The Gene Ontology resource: enriching a GOld mine.Nucleic acids research 49, D1 (2021), D325–D334

  2. [2]

    Stephen Alstrup, Gerth Brodal, and Theis Rauhe. 2001. Optimal static range reporting in one dimension. InProc. 33rd Annual ACM Symposium on Theory of Computing (STOC). 476–482

  3. [3]

    Stones, Gang Wang, Xiaoguang Liu, Jing Liu, and Sheng Lin

    Naiyong Ao, Fan Zhang, Di Wu, Douglas S. Stones, Gang Wang, Xiaoguang Liu, Jing Liu, and Sheng Lin. 2011. Efficient Parallel Lists Intersection and Index Compression Algorithms Using Graphics Processing Units.PVLDB4, 8 (2011), 470–481. https://doi.org/10.14778/2002974.2002975 12

  4. [4]

    Sepehr Assadi, Martin Farach-Colton, and William Kuszmaul. 2023. Tight Bounds for Monotone Minimal Perfect Hashing. InProc. 34th Annual ACM- SIAM Symposium on Discrete Algorithms (SODA). 456–476. https://doi.org/10. 1137/1.9781611977554.CH20

  5. [5]

    Garofalakis, and Rajeev Rastogi

    Shivnath Babu, Minos N. Garofalakis, and Rajeev Rastogi. 2001. SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables. InProc. 28th ACM International Conference on Management of Data (SIGMOD). 283–294. https://doi.org/10.1145/375663.375693

  6. [6]

    Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. 2010. Fast prefix search in little space, with applications. InProc. 18th Annual European Symposium on Algorithms (ESA). 427–438

  7. [7]

    Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. 2011. The- ory and practice of monotone minimal perfect hashing.ACM J. Exp. Algorithmics 16 (2011). https://doi.org/10.1145/1963190.2025378

  8. [8]

    Djamal Belazzougui and Rossano Venturini. 2013. Compressed static functions with applications. InProc. 24th Annual ACM-SIAM Symposium on Discrete Algo- rithms (SODA). SIAM, 229–240. https://doi.org/10.1137/1.9781611973105.17

  9. [9]

    Bender, Alex Conway, Martin Farach-Colton, William Kuszmaul, and Guido Tagliavini

    Michael A. Bender, Alex Conway, Martin Farach-Colton, William Kuszmaul, and Guido Tagliavini. 2023. Iceberg Hashing: Optimizing Many Hash-Table Criteria at Once.J. ACM70, 6 (2023), 40:1–40:51. https://doi.org/10.1145/3625817

  10. [10]

    Bercea, Jakob Bæk Tejs Houen, and Rasmus Pagh

    Ioana O. Bercea, Jakob Bæk Tejs Houen, and Rasmus Pagh. 2024. Daisy Bloom Filters. InProc. 19th Scandinavian Symposium and Workshops on Algorithm Theory (SW AT). 9:1–9:19. https://doi.org/10.4230/LIPICS.SWAT.2024.9

  11. [11]

    Jock Blackard. 1998. Covertype. UCI Machine Learning Repository. https: //doi.org/10.24432/C50K5N

  12. [12]

    Burton H. Bloom. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors.Commun. ACM13, 7 (1970), 422–426. https://doi.org/10.1145/362686. 362692

  13. [13]

    Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. 2022. A Learned Approach to Design Compressed Rank/Select Data Structures.ACM Trans. on Algorithms18, 3, Article 24 (oct 2022), 28 pages. https://doi.org/10.1145/3524060

  14. [14]

    Botelho, Yoshiharu Kohayakawa, and Nivio Ziviani

    Fabiano C. Botelho, Yoshiharu Kohayakawa, and Nivio Ziviani. 2005. A Prac- tical Minimal Perfect Hashing Method. InProc. 4th International Workshop on Experimental and Efficient Algorithms (WEA). 488–500. https://doi.org/10.1007/ 11427186_42

  15. [15]

    Botelho, Rasmus Pagh, and Nivio Ziviani

    Fabiano C. Botelho, Rasmus Pagh, and Nivio Ziviani. 2013. Practical Perfect Hashing in Nearly Optimal Space.Inf. Syst.38, 1 (2013), 108–131. https://doi. org/10.1016/j.is.2012.06.002

  16. [16]

    Jehoshua Bruck, Jie Gao, and Anxiao Jiang. 2006. Weighted Bloom filter. In Proc. 39th IEEE International Symposium on Information Theory (ISIT). 2304–2308. https://doi.org/10.1109/ISIT.2006.261978

  17. [17]

    Hayoung Byun and Hyesook Lim. 2022. Learned FBF: Learning-Based Functional Bloom Filter for Key-Value Storage.IEEE Trans. Computers71, 8 (2022), 1928–1938. https://doi.org/10.1109/TC.2021.3112079

  18. [18]

    Benjamin Coleman, David Torres Ramos, Vihan Lakshman, Chen Luo, and An- shumali Shrivastava. 2023. CARAMEL: A Succinct Read-Only Lookup Table via Compressed Static Functions. (2023). https://doi.org/10.48550/arXiv.2305.16545

  19. [19]

    Colin Cooper. 2000. On the rank of random matrices.Random Struct. Algorithms 16, 2 (2000), 209–232. https://doi.org/10.1002/(SICI)1098-2418(200003)16:2% 3C209::AID-RSA6%3E3.0.CO;2-1

  20. [20]

    Zhenwei Dai and Anshumali Shrivastava. 2020. Adaptive Learned Bloom Filter (Ada-BF): Efficient Utilization of the Classifier with Application to Real-Time Information Filtering on the Web. InProc. 33rd Annual Conference on Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/ 2020/hash/86b94dae7c6517ec1ac767fd2c136580-Abst...

  21. [21]

    Zhenwei Dai, Anshumali Shrivastava, Pedro Reviriego, and José Alberto Hernán- dez. 2022. Optimizing Learned Bloom Filters: How Much Should Be Learned? IEEE Embed. Syst. Lett.14, 3 (2022), 123–126. https://doi.org/10.1109/LES.2022. 3156019

  22. [22]

    Scott Davies and Andrew W. Moore. 1999. Bayesian Networks for Lossless Dataset Compression. InProc. 5th ACM International Conference on Knowledge Discovery and Data Mining, (KDD). 387–391. https://doi.org/10.1145/312129. 312289

  23. [23]

    1981.Assessing Probability Assessors: Calibration and Refinement.Technical Report

    Morris H DeGroot and Stephen E Fienberg. 1981.Assessing Probability Assessors: Calibration and Refinement.Technical Report

  24. [24]

    Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, and Joel Veness. 2024. Language Modeling Is Compression. InProc. 12th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=jznbgiynus

  25. [25]

    Martin Dietzfelbinger and Rasmus Pagh. 2008. Succinct Data Structures for Retrieval and Approximate Membership (Extended Abstract). InProc. 35th Inter- national Colloquium on Automata, Languages, and Programming (ICALP). 385–396. https://doi.org/10.1007/978-3-540-70575-8_32

  26. [26]

    Dillinger, Lorenz Hübschle-Schneider, Peter Sanders, and Stefan Walzer

    Peter C. Dillinger, Lorenz Hübschle-Schneider, Peter Sanders, and Stefan Walzer

  27. [27]

    Fast Succinct Retrieval and Approximate Membership Using Ribbon. In Proc. 20th International Symposium on Experimental Algorithms (SEA). 4:1–4:20. https://doi.org/10.4230/LIPICS.SEA.2022.4

  28. [28]

    Paolo Ferragina, Hans-Peter Lehmann, Peter Sanders, and Giorgio Vinciguerra

  29. [29]

    CPM.2016.13,doi:10.4230/LIPICS.CPM.2016.13

    Learned monotone minimal perfect hashing. InProceedings of the 31st Annual European Symposium on Algorithms (ESA). https://doi.org/10.4230/LIPIcs. ESA.2023.46

  30. [30]

    Paolo Ferragina, Giovanni Manzini, and Giorgio Vinciguerra. 2022. Compressing and Querying Integer Dictionaries Under Linearities and Repetitions.IEEE Access 10 (2022), 118831–118848. https://doi.org/10.1109/ACCESS.2022.3221520

  31. [31]

    Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds.PVLDB13, 8 (2020), 1162–1175. https://doi.org/10.14778/3389133.3389135

  32. [32]

    Fredman, János Komlós, and Endre Szemerédi

    Michael L. Fredman, János Komlós, and Endre Szemerédi. 1984. Storing a Sparse Table with 𝑂( 1) Worst Case Access Time.J. ACM31, 3 (1984), 538–544. https: //doi.org/10.1145/828.1884

  33. [33]

    Gallager

    Robert G. Gallager. 1978. Variations on a theme by Huffman.IEEE Trans. Inf. Theory24, 6 (1978), 668–674. https://doi.org/10.1109/TIT.1978.1055959

  34. [34]

    Parameswaran

    Yihan Gao and Aditya G. Parameswaran. 2016. Squish: Near-Optimal Com- pression for Archival of Relational Datasets. InProc. 22nd ACM International Conference on Knowledge Discovery and Data Mining (KDD). 1575–1584. https: //doi.org/10.1145/2939672.2939867

  35. [35]

    Marco Genuzio, Giuseppe Ottaviano, and Sebastiano Vigna. 2020. Fast scalable construction of ([compressed] static | minimal perfect hash) functions.Inf. Comput.273 (2020), 104517. https://doi.org/10.1016/J.IC.2020.104517

  36. [36]

    Shelton, Andrew S

    Dave Gomboc, Christian R. Shelton, Andrew S. Miner, and Gianfranco Ciardo

  37. [37]

    Comparing Lossless Compression Methods for Chess Endgame Data. In Proc. 27th European Conference on Artificial Intelligence (ECAI), Vol. 392. IOS Press, 4116–4123. https://doi.org/10.3233/FAIA240982

  38. [38]

    Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, and Idoia Ochoa. 2019. DeepZip: Lossless Data Compression Using Recurrent Neural Networks. InProc. 29th Data Compression Conference (DCC). 575. https://doi.org/10.1109/DCC.2019. 00087

  39. [39]

    Thomas Mueller Graf and Daniel Lemire. 2020. Xor Filters.ACM J. Exp. Algo- rithmics25 (2020), 1–16. https://doi.org/10.1145/3376122

  40. [40]

    Thomas Mueller Graf and Daniel Lemire. 2022. Binary Fuse Filters: Fast and Smaller Than Xor Filters.ACM J. Exp. Algorithmics27 (2022), 1.5:1–1.5:15. https://doi.org/10.1145/3510449

  41. [41]

    Andrea Guerra, Giorgio Vinciguerra, Antonio Boffa, and Paolo Ferragina. 2025. Learned Compression of Nonlinear Time Series With Random Access. InProc. 41st IEEE International Conference on Data Engineering (ICDE). 1579–1592. https: //doi.org/10.1109/ICDE65448.2025.00122

  42. [42]

    Torben Hagerup and Torsten Tholey. 2001. Efficient Minimal Perfect Hashing in Nearly Minimal Space. InProc. 18th Annual Symposium on Theoretical As- pects of Computer Science (STACS), Vol. 2010. 317–326. https://doi.org/10.1007/ 3-540-44693-1_28

  43. [43]

    Stefan Hermann. 2025. MorphisHash: Improving Space Efficiency of ShockHash for Minimal Perfect Hashing.arXiv preprint arXiv:2503.10161(2025)

  44. [44]

    Hreinsson, Morten Krøyer, and Rasmus Pagh

    Jóhannes B. Hreinsson, Morten Krøyer, and Rasmus Pagh. 2009. Storing a Compressed Function with Constant Time Access. InProc. 17th Annual Eu- ropean Symposium on Algorithms (ESA). 730–741. https://doi.org/10.1007/ 978-3-642-04128-0_65

  45. [45]

    David A Huffman. 2007. A method for the construction of minimum-redundancy codes.Proceedings of the IRE40, 9 (2007), 1098–1101

  46. [46]

    Amir Ilkhechi, Andrew Crotty, Alex Galakatos, Yicong Mao, Grace Fan, Xiran Shi, and Ugur Cetintemel. 2020. DeepSqueeze: Deep Semantic Compression for Tabular Data. InProc. 46th ACM International Conference on Management of Data (SIGMOD). 1733–1746. https://doi.org/10.1145/3318464.3389734

  47. [47]

    Amitansh Joshi, Amit Parolkar, and Vedant Das. 2023. Spotify_1Million_Tracks. https://doi.org/10.34740/KAGGLE/DSV/5987852

  48. [48]

    Minoru Kanehisa, Miho Furumichi, Yoko Sato, Mari Ishiguro-Watanabe, and Mao Tanabe. 2021. KEGG: integrating viruses and cellular organisms.Nucleic acids research49, D1 (2021), D545–D551

  49. [49]

    Kiss, Éva Hosszu, János Tapolcai, Lajos Rónyai, and Ori Rottenstreich

    Sándor Z. Kiss, Éva Hosszu, János Tapolcai, Lajos Rónyai, and Ori Rottenstreich

  50. [50]

    Bloom Filter With a False Positive Free Zone.IEEE Trans. Netw. Serv. Manag.18, 2 (2021), 2334–2349. https://doi.org/10.1109/TNSM.2021.3059075

  51. [51]

    1998.The art of computer programming: Volume 3: Sorting and Searching

    Donald E Knuth. 1998.The art of computer programming: Volume 3: Sorting and Searching. Addison-Wesley Professional

  52. [52]

    Puglisi, and Rajeev Raman

    Dominik Köppl, Simon J. Puglisi, and Rajeev Raman. 2022. Fast and Simple Compact Hashing via Bucketing.Algorithmica84, 9 (2022), 2735–2766. https: //doi.org/10.1007/S00453-022-00996-Y

  53. [53]

    Chi, Jeffrey Dean, and Neoklis Polyzotis

    Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. InProc. 44th ACM International Conference on Management of Data (SIGMOD). 489–504. https://doi.org/10.1145/3183713. 3196909

  54. [54]

    Oguzhan Külekci

    M. Oguzhan Külekci. 2014. Enhanced Variable-Length Codes: Improved Com- pression with Efficient Random Access. InProc. 24th Data Compression Conference (DCC). 362–371. https://doi.org/10.1109/DCC.2014.74

  55. [55]

    Florian Kurpicz, Hans-Peter Lehmann, and Peter Sanders. 2023. PaCHash: Packed and compressed hash tables. InProc. 25th Symposium on Algorithm Engineering 13 and Experiments (ALENEX). 162–175. https://doi.org/10.1137/1.9781611977561. ch14

  56. [56]

    Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. 2023. SicHash - Small Irregular Cuckoo Tables for Perfect Hashing. InProc. 25th Symposium on Algo- rithm Engineering and Experiments (ALENEX). 176–189. https://doi.org/10.1137/ 1.9781611977561.ch15

  57. [57]

    Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. 2024. ShockHash: To- wards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force. InProc. 26th Symposium on Algorithm Engineering and Experiments (ALENEX). 194–206. https://doi.org/10.1137/1.9781611977929.15

  58. [58]

    Hans-Peter Lehmann, Thomas Mueller, Rasmus Pagh, Giulio Ermanno Pibiri, Peter Sanders, Sebastiano Vigna, and Stefan Walzer. 2025. Modern Minimal Perfect Hashing: A Survey.CoRRabs/2506.06536 (2025). https://doi.org/10. 48550/ARXIV.2506.06536

  59. [59]

    Hanwen Liu, Mihail Stoian, Alexander van Renen, and Andreas Kipf. 2024. Corra: Correlation-Aware Column Compression. InProc. 2nd Workshop on Cloud Databases (CloudDB). https://vldb.org/workshops/2024/proceedings/CloudDB/ clouddb-2.pdf

  60. [60]

    Yihao Liu, Xinyu Zeng, and Huanchen Zhang. 2024. LeCo: Lightweight Com- pression via Learning Serial Correlations.Proc. ACM Manag. Data2, 1, Article 65 (mar 2024), 28 pages. https://doi.org/10.1145/3639320

  61. [61]

    Dario Malchiodi, Davide Raimondi, Giacomo Fumagalli, Raffaele Giancarlo, and Marco Frasca. 2024. The role of classifiers and data complexity in learned Bloom filters: insights and recommendations.J. Big Data11, 1 (2024), 45. https: //doi.org/10.1186/S40537-024-00906-9

  62. [62]

    Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and Optimizing by Sandwiching. InProc. 31st Annual Conference on Neural Information Processing Systems (NeurIPS). 462–471. https://proceedings.neurips.cc/paper/2018/hash/ 0f49c89d1e7298bb9930789c8ed59d48-Abstract.html

  63. [63]

    Christian Worm Mortensen, Rasmus Pagh, and Mihai Pˇatraçcu. 2005. On dynamic range reporting in one dimension. InProc. 37th annual ACM Symposium on Theory of Computing (STOC). 104–111

  64. [64]

    Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtain- ing well calibrated probabilities using bayesian binning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 29

  65. [65]

    2016.Compact data structures: a practical approach

    Gonzalo Navarro. 2016.Compact data structures: a practical approach. Cambridge University Press

  66. [66]

    Rasmus Pagh and Flemming Friche Rodler. 2004. Cuckoo hashing.J. Algorithms 51, 2 (2004), 122–144. https://doi.org/10.1016/j.jalgor.2003.12.002

  67. [67]

    Ripon Patgiri, Anupam Biswas, and Sabuzima Nayak. 2023. deepBF: Malicious URL detection using learned Bloom Filter and evolutionary deep learning.Com- put. Commun.200 (2023), 30–41. https://doi.org/10.1016/J.COMCOM.2022.12.027

  68. [68]

    Mihai Pătraşcu. 2008. Succincter. InProc. 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS). 305–313. https://doi.org/10.1109/FOCS. 2008.83

  69. [69]

    Ely Porat. 2009. An Optimal Bloom Filter Replacement Based on Matrix Solving. InProc. 4th International Computer Science Symposium in Russia (CSR). 263–273. https://doi.org/10.1007/978-3-642-03351-3_25

  70. [70]

    Arvind Prasad and Shalini Chandra. 2024. PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning.Comput. Secur.136 (2024), 103545. https://doi.org/10.1016/ J.COSE.2023.103545 Dataset available at https://archive.ics.uci.edu/dataset/967/ phiusiil+phishing+url+dataset

  71. [71]

    Rae, Sergey Bartunov, and Timothy P

    Jack W. Rae, Sergey Bartunov, and Timothy P. Lillicrap. 2019. Meta-Learning Neural Bloom Filters. InProc. 36th International Conference on Machine Learning (ICML). 5271–5280. http://proceedings.mlr.press/v97/rae19a.html

  72. [72]

    Julian Reichinger, Thomas Krismayer, and Jan Rellermeyer. 2024. COPR–Efficient, large-scale log storage and retrieval.arXiv preprint arXiv:2402.18355(2024)

  73. [73]

    Pedro Reviriego, José Alberto Hernández, Zhenwei Dai, and Anshumali Shrivas- tava. 2021. Learned Bloom Filters in Adversarial Environments: A Malicious URL Detection Use-Case. InProc. 22nd IEEE International Conference on High Perfor- mance Switching and Routing (HPSR). 1–6. https://doi.org/10.1109/HPSR52026. 2021.9481857

  74. [74]

    Ibrahim Sabek, Kapil Vaidya, Dominik Horn, Andreas Kipf, Michael Mitzen- macher, and Tim Kraska. 2022. Can Learned Models Replace Hash Functions? PVLDB16, 3 (2022), 532–545. https://doi.org/10.14778/3570690.3570702

  75. [75]

    Mohanad Sarhan, Siamak Layeghy, Nour Moustafa, and Marius Portmann. 2020. NetFlow Datasets for Machine Learning-Based Network Intrusion Detection Sys- tems. InProc. 10th EAI International Conference on Big Data Technologies and Appli- cations (LNICST), Vol. 371. 117–135. https://doi.org/10.1007/978-3-030-72802-1_9 Dataset available at https://staff.itee.u...

  76. [76]

    Atsuki Sato and Yusuke Matsui. 2023. Fast Partitioned Learned Bloom Filter. InProc. 36th Annual Conference on Neural Information Process- ing Systems (NeurIPS). http://papers.nips.cc/paper_files/paper/2023/hash/ 7b2e844c52349134268e819a9b56b9e8-Abstract-Conference.html

  77. [77]

    Jürgen Schmidhuber and Stefan Heil. 1996. Sequential neural text compression. IEEE Trans. Neural Networks7, 1 (1996), 142–146. https://doi.org/10.1109/72. 478398

  78. [78]

    Claude E Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423

  79. [79]

    Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. 2022. Space- efficient representation of genomic k-mer count tables.Algorithms Mol. Biol.17, 1 (2022), 5. https://doi.org/10.1186/S13015-022-00212-0

  80. [80]

    Samuel D. Stearns. 1995. Arithmetic coding in lossless waveform compression. IEEE Trans. Signal Process.43, 8 (1995), 1874–1879. https://doi.org/10.1109/78. 403346

Showing first 80 references.