Learned Static Function Data Structures

Giorgio Vinciguerra; Hans-Peter Lehmann; Stefan Hermann; Stefan Walzer

arxiv: 2510.27588 · v3 · pith:APVXVHWLnew · submitted 2025-10-31 · 💻 cs.DS · cs.DB· cs.LG

Learned Static Function Data Structures

Stefan Hermann , Hans-Peter Lehmann , Giorgio Vinciguerra , Stefan Walzer This is my paper

Pith reviewed 2026-05-21 19:53 UTC · model grok-4.3

classification 💻 cs.DS cs.DBcs.LG

keywords static functionslearned data structuresprefix codeszero-order entropymachine learningdata compressionpoint querieskey-value associations

0 comments

The pith

Learned static functions use machine learning to predict per-key value probabilities and build shorter prefix codes, breaking the zero-order entropy barrier for point queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces learned static functions for associating a fixed set of keys with values without storing the keys themselves. Traditional static functions are limited by the zero-order entropy of the value sequence, but here a machine learning model first predicts a probability distribution over possible values for each key. From that distribution the method derives a custom prefix code tailored to the key, then stores the resulting short codeword inside an ordinary static function. If the model finds real correlations between keys and values, the average code length falls below what any single global code could achieve. Experiments report space reductions of up to an order of magnitude on real data sets and three orders of magnitude on synthetic data, all while still answering point queries in constant time.

Core claim

The central claim is that a machine-learning model can capture non-trivial correlations between keys and values, allowing the construction of a key-specific prefix code whose expected length is shorter than the zero-order entropy of the value sequence; these compact codewords are then stored in a conventional static function data structure that supports point queries without materializing the key set.

What carries the argument

Per-key prefix code derived from a machine-learning model's predicted probability distribution over values, stored inside a classic static function.

If this is right

Space usage can fall below the zero-order entropy bound while point queries remain supported.
Savings scale with the strength of correlations the model can learn.
The approach works for any static function that can store the derived codewords.
Real-world data sets with natural key-value dependencies yield up to 10x compression; synthetic data with engineered correlations yield up to 1000x compression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prediction-plus-prefix-code idea could be layered on top of existing compressed static functions to obtain further gains.
If the model is cheap to evaluate, the method might extend to semi-static or slowly changing data sets without rebuilding from scratch.
Applications that already use static functions for dictionaries or Bloom-filter alternatives could adopt this technique to reduce memory footprint.

Load-bearing premise

The machine learning model must capture non-trivial correlations between keys and values so that the per-key prefix codes are shorter on average than any single global code.

What would settle it

Run the construction on a data set in which keys and values are generated independently; if measured space usage remains strictly above the zero-order entropy of the values, the claimed improvement does not occur.

Figures

Figures reproduced from arXiv: 2510.27588 by Giorgio Vinciguerra, Hans-Peter Lehmann, Stefan Hermann, Stefan Walzer.

**Figure 2.** Figure 2: Structure of the BuRR equation system [26]. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: A VL-BuRR data structure using ℓmax separate 1-bit SFs, given by column vectors 𝒁0, . . . , 𝒁ℓmax−1 (here ℓmax = 6). Shaded areas are two examples of the bits accessed by a single query. Grey dots indicate individual bits, dotted lines indicate how the bits are grouped into words of length𝒘, and numbers indicate the order in which these words are stored. ℓmax instances of 1-bit BuRR, where ℓmax is the leng… view at source ↗

**Figure 4.** Figure 4: Idealised space overhead of filter-based WRM for [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Query time, inference time (dashed) and space over [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

We consider the task of constructing a data structure for associating a static set of keys with values, while allowing arbitrary output values for queries involving keys outside the set. Compared to hash tables, these so-called static function data structures do not need to store the key set and thus use significantly less memory. Several techniques are known, with compressed static functions approaching the zero-order empirical entropy of the value sequence. In this paper, we introduce learned static functions, which use machine learning to capture correlations between keys and values. For each key, a model predicts a probability distribution over the values, from which we derive a key-specific prefix code to compactly encode the true value. The resulting codeword is stored in a classic static function data structure. This design allows learned static functions to break the zero-order entropy barrier while still supporting point queries. Our experiments show substantial space savings: up to one order of magnitude on real data, and up to three orders of magnitude on synthetic data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gets static functions below zero-order entropy by deriving per-key prefix codes from an ML-predicted distribution and storing them in a standard static function.

read the letter

The main thing to know is that the construction uses a model to predict a value distribution for each key, builds a short prefix code from that prediction, and packs the actual codeword into an ordinary static function. Total space is model size plus the sum of those code lengths. When the model finds real key-value dependence, average code length drops below the global entropy of the value sequence, which is the source of the reported gains.

Referee Report

2 major / 2 minor

Summary. The paper introduces learned static functions for static key-value associations that support point queries. An ML model predicts a per-key distribution over values; a key-specific prefix code is derived from this distribution and the codeword is stored in a standard static-function data structure. The construction is claimed to break the zero-order entropy barrier of the value sequence when the model captures non-trivial key-value correlations. Experiments report concrete space savings of up to 10× on real data and up to three orders of magnitude on synthetic data.

Significance. If the central claim holds, the result is significant for the data-structures community: it supplies a concrete, implementable method that integrates standard ML training with existing static-function primitives to obtain compression strictly below H0 while preserving O(1) queries. The approach is not circular (savings are measured against external baselines) and the reported ratios on both real and synthetic instances provide direct, falsifiable evidence that the inequality can be realized in practice.

major comments (2)

[Experiments] Experiments section: the total-space figures must explicitly include the size of the trained ML model (plus any auxiliary structures) in addition to the static-function storage of the codewords; without this accounting the claim that the construction breaks the zero-order entropy barrier cannot be verified.
[§4] §4 (or equivalent experimental subsection): the paper should state the experimental protocol for instance selection and report whether the quoted savings ratios come from a pre-specified set of runs or were obtained after observing the data; inclusion of standard error bars or multiple independent trials would directly address this.

minor comments (2)

[Abstract] Abstract: the phrases 'up to one order of magnitude' and 'up to three orders of magnitude' would be clearer if the paper distinguished maximum observed savings from average savings across the test suite.
[Construction] Notation: the manuscript should adopt a uniform symbol for the learned distribution (e.g., p̂_v|k) and for the resulting code length to avoid ambiguity when comparing against H0.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the experimental reporting. We address each major comment below and will incorporate the requested clarifications in a revised version of the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the total-space figures must explicitly include the size of the trained ML model (plus any auxiliary structures) in addition to the static-function storage of the codewords; without this accounting the claim that the construction breaks the zero-order entropy barrier cannot be verified.

Authors: We agree that total-space figures must account for the trained ML model and auxiliary structures to allow verification of the entropy-barrier claim. In the revised manuscript we will add explicit reporting of model sizes (and any auxiliary data) alongside the static-function storage costs, and we will recompute and present the space savings ratios using these complete figures. revision: yes
Referee: [§4] §4 (or equivalent experimental subsection): the paper should state the experimental protocol for instance selection and report whether the quoted savings ratios come from a pre-specified set of runs or were obtained after observing the data; inclusion of standard error bars or multiple independent trials would directly address this.

Authors: We will expand §4 to describe the instance-selection protocol in detail and to state whether the reported ratios derive from a pre-specified set of runs. Where feasible we will also add results from multiple independent trials together with standard error bars. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's construction trains an ML model on the key-value pairs to output per-key value distributions, derives Huffman-style prefix codes from those distributions, and stores the resulting codewords inside an off-the-shelf static-function data structure whose space is bounded by the sum of the code lengths plus the model size. The claimed improvement below zero-order entropy H0 follows directly from the information-theoretic property that a code whose lengths are -log p_i (where p_i is the model-predicted probability) is shorter on average precisely when the model assigns higher probability to the true value than the uniform or global distribution; this inequality is measured on held-out or synthetic instances against external baselines and does not reduce to a fitted parameter being renamed as a prediction. No load-bearing step invokes a self-citation whose content is itself unverified, nor does any equation equate a derived quantity to its own input by definition. The experiments therefore constitute independent evidence rather than a tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence of learnable key-value correlations and on the correctness of the underlying static function primitive; no new mathematical axioms are introduced.

free parameters (1)

ML model architecture and hyperparameters
Choice of model family and training settings are selected to fit the observed key-value correlations.

pith-pipeline@v0.9.0 · 5698 in / 1047 out tokens · 99866 ms · 2026-05-21T19:53:25.129141+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages

[1]

The Gene Ontology resource: enriching a GOld mine.Nucleic acids research 49, D1 (2021), D325–D334

2021. The Gene Ontology resource: enriching a GOld mine.Nucleic acids research 49, D1 (2021), D325–D334

work page 2021
[2]

Stephen Alstrup, Gerth Brodal, and Theis Rauhe. 2001. Optimal static range reporting in one dimension. InProc. 33rd Annual ACM Symposium on Theory of Computing (STOC). 476–482

work page 2001
[3]

Stones, Gang Wang, Xiaoguang Liu, Jing Liu, and Sheng Lin

Naiyong Ao, Fan Zhang, Di Wu, Douglas S. Stones, Gang Wang, Xiaoguang Liu, Jing Liu, and Sheng Lin. 2011. Efficient Parallel Lists Intersection and Index Compression Algorithms Using Graphics Processing Units.PVLDB4, 8 (2011), 470–481. https://doi.org/10.14778/2002974.2002975 12

work page doi:10.14778/2002974.2002975 2011
[4]

Sepehr Assadi, Martin Farach-Colton, and William Kuszmaul. 2023. Tight Bounds for Monotone Minimal Perfect Hashing. InProc. 34th Annual ACM- SIAM Symposium on Discrete Algorithms (SODA). 456–476. https://doi.org/10. 1137/1.9781611977554.CH20

work page 2023
[5]

Garofalakis, and Rajeev Rastogi

Shivnath Babu, Minos N. Garofalakis, and Rajeev Rastogi. 2001. SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables. InProc. 28th ACM International Conference on Management of Data (SIGMOD). 283–294. https://doi.org/10.1145/375663.375693

work page doi:10.1145/375663.375693 2001
[6]

Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. 2010. Fast prefix search in little space, with applications. InProc. 18th Annual European Symposium on Algorithms (ESA). 427–438

work page 2010
[7]

Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. 2011. The- ory and practice of monotone minimal perfect hashing.ACM J. Exp. Algorithmics 16 (2011). https://doi.org/10.1145/1963190.2025378

work page doi:10.1145/1963190.2025378 2011
[8]

Djamal Belazzougui and Rossano Venturini. 2013. Compressed static functions with applications. InProc. 24th Annual ACM-SIAM Symposium on Discrete Algo- rithms (SODA). SIAM, 229–240. https://doi.org/10.1137/1.9781611973105.17

work page doi:10.1137/1.9781611973105.17 2013
[9]

Bender, Alex Conway, Martin Farach-Colton, William Kuszmaul, and Guido Tagliavini

Michael A. Bender, Alex Conway, Martin Farach-Colton, William Kuszmaul, and Guido Tagliavini. 2023. Iceberg Hashing: Optimizing Many Hash-Table Criteria at Once.J. ACM70, 6 (2023), 40:1–40:51. https://doi.org/10.1145/3625817

work page doi:10.1145/3625817 2023
[10]

Bercea, Jakob Bæk Tejs Houen, and Rasmus Pagh

Ioana O. Bercea, Jakob Bæk Tejs Houen, and Rasmus Pagh. 2024. Daisy Bloom Filters. InProc. 19th Scandinavian Symposium and Workshops on Algorithm Theory (SW AT). 9:1–9:19. https://doi.org/10.4230/LIPICS.SWAT.2024.9

work page doi:10.4230/lipics.swat.2024.9 2024
[11]

Jock Blackard. 1998. Covertype. UCI Machine Learning Repository. https: //doi.org/10.24432/C50K5N

work page doi:10.24432/c50k5n 1998
[12]

Burton H. Bloom. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors.Commun. ACM13, 7 (1970), 422–426. https://doi.org/10.1145/362686. 362692

work page doi:10.1145/362686 1970
[13]

Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. 2022. A Learned Approach to Design Compressed Rank/Select Data Structures.ACM Trans. on Algorithms18, 3, Article 24 (oct 2022), 28 pages. https://doi.org/10.1145/3524060

work page doi:10.1145/3524060 2022
[14]

Botelho, Yoshiharu Kohayakawa, and Nivio Ziviani

Fabiano C. Botelho, Yoshiharu Kohayakawa, and Nivio Ziviani. 2005. A Prac- tical Minimal Perfect Hashing Method. InProc. 4th International Workshop on Experimental and Efficient Algorithms (WEA). 488–500. https://doi.org/10.1007/ 11427186_42

work page 2005
[15]

Botelho, Rasmus Pagh, and Nivio Ziviani

Fabiano C. Botelho, Rasmus Pagh, and Nivio Ziviani. 2013. Practical Perfect Hashing in Nearly Optimal Space.Inf. Syst.38, 1 (2013), 108–131. https://doi. org/10.1016/j.is.2012.06.002

work page doi:10.1016/j.is.2012.06.002 2013
[16]

Jehoshua Bruck, Jie Gao, and Anxiao Jiang. 2006. Weighted Bloom filter. In Proc. 39th IEEE International Symposium on Information Theory (ISIT). 2304–2308. https://doi.org/10.1109/ISIT.2006.261978

work page doi:10.1109/isit.2006.261978 2006
[17]

Hayoung Byun and Hyesook Lim. 2022. Learned FBF: Learning-Based Functional Bloom Filter for Key-Value Storage.IEEE Trans. Computers71, 8 (2022), 1928–1938. https://doi.org/10.1109/TC.2021.3112079

work page doi:10.1109/tc.2021.3112079 2022
[18]

Benjamin Coleman, David Torres Ramos, Vihan Lakshman, Chen Luo, and An- shumali Shrivastava. 2023. CARAMEL: A Succinct Read-Only Lookup Table via Compressed Static Functions. (2023). https://doi.org/10.48550/arXiv.2305.16545

work page doi:10.48550/arxiv.2305.16545 2023
[19]

Colin Cooper. 2000. On the rank of random matrices.Random Struct. Algorithms 16, 2 (2000), 209–232. https://doi.org/10.1002/(SICI)1098-2418(200003)16:2% 3C209::AID-RSA6%3E3.0.CO;2-1

work page doi:10.1002/(sici)1098-2418(200003)16:2 2000
[20]

Zhenwei Dai and Anshumali Shrivastava. 2020. Adaptive Learned Bloom Filter (Ada-BF): Efficient Utilization of the Classifier with Application to Real-Time Information Filtering on the Web. InProc. 33rd Annual Conference on Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/ 2020/hash/86b94dae7c6517ec1ac767fd2c136580-Abst...

work page 2020
[21]

Zhenwei Dai, Anshumali Shrivastava, Pedro Reviriego, and José Alberto Hernán- dez. 2022. Optimizing Learned Bloom Filters: How Much Should Be Learned? IEEE Embed. Syst. Lett.14, 3 (2022), 123–126. https://doi.org/10.1109/LES.2022. 3156019

work page doi:10.1109/les.2022 2022
[22]

Scott Davies and Andrew W. Moore. 1999. Bayesian Networks for Lossless Dataset Compression. InProc. 5th ACM International Conference on Knowledge Discovery and Data Mining, (KDD). 387–391. https://doi.org/10.1145/312129. 312289

work page doi:10.1145/312129 1999
[23]

1981.Assessing Probability Assessors: Calibration and Refinement.Technical Report

Morris H DeGroot and Stephen E Fienberg. 1981.Assessing Probability Assessors: Calibration and Refinement.Technical Report

work page 1981
[24]

Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, and Joel Veness. 2024. Language Modeling Is Compression. InProc. 12th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=jznbgiynus

work page 2024
[25]

Martin Dietzfelbinger and Rasmus Pagh. 2008. Succinct Data Structures for Retrieval and Approximate Membership (Extended Abstract). InProc. 35th Inter- national Colloquium on Automata, Languages, and Programming (ICALP). 385–396. https://doi.org/10.1007/978-3-540-70575-8_32

work page doi:10.1007/978-3-540-70575-8_32 2008
[26]

Dillinger, Lorenz Hübschle-Schneider, Peter Sanders, and Stefan Walzer

Peter C. Dillinger, Lorenz Hübschle-Schneider, Peter Sanders, and Stefan Walzer

work page
[27]

Fast Succinct Retrieval and Approximate Membership Using Ribbon. In Proc. 20th International Symposium on Experimental Algorithms (SEA). 4:1–4:20. https://doi.org/10.4230/LIPICS.SEA.2022.4

work page doi:10.4230/lipics.sea.2022.4 2022
[28]

Paolo Ferragina, Hans-Peter Lehmann, Peter Sanders, and Giorgio Vinciguerra

work page
[29]

CPM.2016.13,doi:10.4230/LIPICS.CPM.2016.13

Learned monotone minimal perfect hashing. InProceedings of the 31st Annual European Symposium on Algorithms (ESA). https://doi.org/10.4230/LIPIcs. ESA.2023.46

work page doi:10.4230/lipics 2023
[30]

Paolo Ferragina, Giovanni Manzini, and Giorgio Vinciguerra. 2022. Compressing and Querying Integer Dictionaries Under Linearities and Repetitions.IEEE Access 10 (2022), 118831–118848. https://doi.org/10.1109/ACCESS.2022.3221520

work page doi:10.1109/access.2022.3221520 2022
[31]

Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds.PVLDB13, 8 (2020), 1162–1175. https://doi.org/10.14778/3389133.3389135

work page doi:10.14778/3389133.3389135 2020
[32]

Fredman, János Komlós, and Endre Szemerédi

Michael L. Fredman, János Komlós, and Endre Szemerédi. 1984. Storing a Sparse Table with 𝑂( 1) Worst Case Access Time.J. ACM31, 3 (1984), 538–544. https: //doi.org/10.1145/828.1884

work page doi:10.1145/828.1884 1984
[33]

Gallager

Robert G. Gallager. 1978. Variations on a theme by Huffman.IEEE Trans. Inf. Theory24, 6 (1978), 668–674. https://doi.org/10.1109/TIT.1978.1055959

work page doi:10.1109/tit.1978.1055959 1978
[34]

Parameswaran

Yihan Gao and Aditya G. Parameswaran. 2016. Squish: Near-Optimal Com- pression for Archival of Relational Datasets. InProc. 22nd ACM International Conference on Knowledge Discovery and Data Mining (KDD). 1575–1584. https: //doi.org/10.1145/2939672.2939867

work page doi:10.1145/2939672.2939867 2016
[35]

Marco Genuzio, Giuseppe Ottaviano, and Sebastiano Vigna. 2020. Fast scalable construction of ([compressed] static | minimal perfect hash) functions.Inf. Comput.273 (2020), 104517. https://doi.org/10.1016/J.IC.2020.104517

work page doi:10.1016/j.ic.2020.104517 2020
[36]

Shelton, Andrew S

Dave Gomboc, Christian R. Shelton, Andrew S. Miner, and Gianfranco Ciardo

work page
[37]

Comparing Lossless Compression Methods for Chess Endgame Data. In Proc. 27th European Conference on Artificial Intelligence (ECAI), Vol. 392. IOS Press, 4116–4123. https://doi.org/10.3233/FAIA240982

work page doi:10.3233/faia240982
[38]

Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, and Idoia Ochoa. 2019. DeepZip: Lossless Data Compression Using Recurrent Neural Networks. InProc. 29th Data Compression Conference (DCC). 575. https://doi.org/10.1109/DCC.2019. 00087

work page doi:10.1109/dcc.2019 2019
[39]

Thomas Mueller Graf and Daniel Lemire. 2020. Xor Filters.ACM J. Exp. Algo- rithmics25 (2020), 1–16. https://doi.org/10.1145/3376122

work page doi:10.1145/3376122 2020
[40]

Thomas Mueller Graf and Daniel Lemire. 2022. Binary Fuse Filters: Fast and Smaller Than Xor Filters.ACM J. Exp. Algorithmics27 (2022), 1.5:1–1.5:15. https://doi.org/10.1145/3510449

work page doi:10.1145/3510449 2022
[41]

Andrea Guerra, Giorgio Vinciguerra, Antonio Boffa, and Paolo Ferragina. 2025. Learned Compression of Nonlinear Time Series With Random Access. InProc. 41st IEEE International Conference on Data Engineering (ICDE). 1579–1592. https: //doi.org/10.1109/ICDE65448.2025.00122

work page doi:10.1109/icde65448.2025.00122 2025
[42]

Torben Hagerup and Torsten Tholey. 2001. Efficient Minimal Perfect Hashing in Nearly Minimal Space. InProc. 18th Annual Symposium on Theoretical As- pects of Computer Science (STACS), Vol. 2010. 317–326. https://doi.org/10.1007/ 3-540-44693-1_28

work page 2001
[43]

Stefan Hermann. 2025. MorphisHash: Improving Space Efficiency of ShockHash for Minimal Perfect Hashing.arXiv preprint arXiv:2503.10161(2025)

work page arXiv 2025
[44]

Hreinsson, Morten Krøyer, and Rasmus Pagh

Jóhannes B. Hreinsson, Morten Krøyer, and Rasmus Pagh. 2009. Storing a Compressed Function with Constant Time Access. InProc. 17th Annual Eu- ropean Symposium on Algorithms (ESA). 730–741. https://doi.org/10.1007/ 978-3-642-04128-0_65

work page 2009
[45]

David A Huffman. 2007. A method for the construction of minimum-redundancy codes.Proceedings of the IRE40, 9 (2007), 1098–1101

work page 2007
[46]

Amir Ilkhechi, Andrew Crotty, Alex Galakatos, Yicong Mao, Grace Fan, Xiran Shi, and Ugur Cetintemel. 2020. DeepSqueeze: Deep Semantic Compression for Tabular Data. InProc. 46th ACM International Conference on Management of Data (SIGMOD). 1733–1746. https://doi.org/10.1145/3318464.3389734

work page doi:10.1145/3318464.3389734 2020
[47]

Amitansh Joshi, Amit Parolkar, and Vedant Das. 2023. Spotify_1Million_Tracks. https://doi.org/10.34740/KAGGLE/DSV/5987852

work page doi:10.34740/kaggle/dsv/5987852 2023
[48]

Minoru Kanehisa, Miho Furumichi, Yoko Sato, Mari Ishiguro-Watanabe, and Mao Tanabe. 2021. KEGG: integrating viruses and cellular organisms.Nucleic acids research49, D1 (2021), D545–D551

work page 2021
[49]

Kiss, Éva Hosszu, János Tapolcai, Lajos Rónyai, and Ori Rottenstreich

Sándor Z. Kiss, Éva Hosszu, János Tapolcai, Lajos Rónyai, and Ori Rottenstreich

work page
[50]

Bloom Filter With a False Positive Free Zone.IEEE Trans. Netw. Serv. Manag.18, 2 (2021), 2334–2349. https://doi.org/10.1109/TNSM.2021.3059075

work page doi:10.1109/tnsm.2021.3059075 2021
[51]

1998.The art of computer programming: Volume 3: Sorting and Searching

Donald E Knuth. 1998.The art of computer programming: Volume 3: Sorting and Searching. Addison-Wesley Professional

work page 1998
[52]

Puglisi, and Rajeev Raman

Dominik Köppl, Simon J. Puglisi, and Rajeev Raman. 2022. Fast and Simple Compact Hashing via Bucketing.Algorithmica84, 9 (2022), 2735–2766. https: //doi.org/10.1007/S00453-022-00996-Y

work page doi:10.1007/s00453-022-00996-y 2022
[53]

Chi, Jeffrey Dean, and Neoklis Polyzotis

Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. InProc. 44th ACM International Conference on Management of Data (SIGMOD). 489–504. https://doi.org/10.1145/3183713. 3196909

work page doi:10.1145/3183713 2018
[54]

Oguzhan Külekci

M. Oguzhan Külekci. 2014. Enhanced Variable-Length Codes: Improved Com- pression with Efficient Random Access. InProc. 24th Data Compression Conference (DCC). 362–371. https://doi.org/10.1109/DCC.2014.74

work page doi:10.1109/dcc.2014.74 2014
[55]

Florian Kurpicz, Hans-Peter Lehmann, and Peter Sanders. 2023. PaCHash: Packed and compressed hash tables. InProc. 25th Symposium on Algorithm Engineering 13 and Experiments (ALENEX). 162–175. https://doi.org/10.1137/1.9781611977561. ch14

work page doi:10.1137/1.9781611977561 2023
[56]

Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. 2023. SicHash - Small Irregular Cuckoo Tables for Perfect Hashing. InProc. 25th Symposium on Algo- rithm Engineering and Experiments (ALENEX). 176–189. https://doi.org/10.1137/ 1.9781611977561.ch15

work page 2023
[57]

Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. 2024. ShockHash: To- wards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force. InProc. 26th Symposium on Algorithm Engineering and Experiments (ALENEX). 194–206. https://doi.org/10.1137/1.9781611977929.15

work page doi:10.1137/1.9781611977929.15 2024
[58]

Hans-Peter Lehmann, Thomas Mueller, Rasmus Pagh, Giulio Ermanno Pibiri, Peter Sanders, Sebastiano Vigna, and Stefan Walzer. 2025. Modern Minimal Perfect Hashing: A Survey.CoRRabs/2506.06536 (2025). https://doi.org/10. 48550/ARXIV.2506.06536

work page arXiv 2025
[59]

Hanwen Liu, Mihail Stoian, Alexander van Renen, and Andreas Kipf. 2024. Corra: Correlation-Aware Column Compression. InProc. 2nd Workshop on Cloud Databases (CloudDB). https://vldb.org/workshops/2024/proceedings/CloudDB/ clouddb-2.pdf

work page 2024
[60]

Yihao Liu, Xinyu Zeng, and Huanchen Zhang. 2024. LeCo: Lightweight Com- pression via Learning Serial Correlations.Proc. ACM Manag. Data2, 1, Article 65 (mar 2024), 28 pages. https://doi.org/10.1145/3639320

work page doi:10.1145/3639320 2024
[61]

Dario Malchiodi, Davide Raimondi, Giacomo Fumagalli, Raffaele Giancarlo, and Marco Frasca. 2024. The role of classifiers and data complexity in learned Bloom filters: insights and recommendations.J. Big Data11, 1 (2024), 45. https: //doi.org/10.1186/S40537-024-00906-9

work page doi:10.1186/s40537-024-00906-9 2024
[62]

Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and Optimizing by Sandwiching. InProc. 31st Annual Conference on Neural Information Processing Systems (NeurIPS). 462–471. https://proceedings.neurips.cc/paper/2018/hash/ 0f49c89d1e7298bb9930789c8ed59d48-Abstract.html

work page 2018
[63]

Christian Worm Mortensen, Rasmus Pagh, and Mihai Pˇatraçcu. 2005. On dynamic range reporting in one dimension. InProc. 37th annual ACM Symposium on Theory of Computing (STOC). 104–111

work page 2005
[64]

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtain- ing well calibrated probabilities using bayesian binning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 29

work page 2015
[65]

2016.Compact data structures: a practical approach

Gonzalo Navarro. 2016.Compact data structures: a practical approach. Cambridge University Press

work page 2016
[66]

Rasmus Pagh and Flemming Friche Rodler. 2004. Cuckoo hashing.J. Algorithms 51, 2 (2004), 122–144. https://doi.org/10.1016/j.jalgor.2003.12.002

work page doi:10.1016/j.jalgor.2003.12.002 2004
[67]

Ripon Patgiri, Anupam Biswas, and Sabuzima Nayak. 2023. deepBF: Malicious URL detection using learned Bloom Filter and evolutionary deep learning.Com- put. Commun.200 (2023), 30–41. https://doi.org/10.1016/J.COMCOM.2022.12.027

work page doi:10.1016/j.comcom.2022.12.027 2023
[68]

Mihai Pătraşcu. 2008. Succincter. InProc. 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS). 305–313. https://doi.org/10.1109/FOCS. 2008.83

work page doi:10.1109/focs 2008
[69]

Ely Porat. 2009. An Optimal Bloom Filter Replacement Based on Matrix Solving. InProc. 4th International Computer Science Symposium in Russia (CSR). 263–273. https://doi.org/10.1007/978-3-642-03351-3_25

work page doi:10.1007/978-3-642-03351-3_25 2009
[70]

Arvind Prasad and Shalini Chandra. 2024. PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning.Comput. Secur.136 (2024), 103545. https://doi.org/10.1016/ J.COSE.2023.103545 Dataset available at https://archive.ics.uci.edu/dataset/967/ phiusiil+phishing+url+dataset

work page arXiv 2024
[71]

Rae, Sergey Bartunov, and Timothy P

Jack W. Rae, Sergey Bartunov, and Timothy P. Lillicrap. 2019. Meta-Learning Neural Bloom Filters. InProc. 36th International Conference on Machine Learning (ICML). 5271–5280. http://proceedings.mlr.press/v97/rae19a.html

work page 2019
[72]

Julian Reichinger, Thomas Krismayer, and Jan Rellermeyer. 2024. COPR–Efficient, large-scale log storage and retrieval.arXiv preprint arXiv:2402.18355(2024)

work page arXiv 2024
[73]

Pedro Reviriego, José Alberto Hernández, Zhenwei Dai, and Anshumali Shrivas- tava. 2021. Learned Bloom Filters in Adversarial Environments: A Malicious URL Detection Use-Case. InProc. 22nd IEEE International Conference on High Perfor- mance Switching and Routing (HPSR). 1–6. https://doi.org/10.1109/HPSR52026. 2021.9481857

work page doi:10.1109/hpsr52026 2021
[74]

Ibrahim Sabek, Kapil Vaidya, Dominik Horn, Andreas Kipf, Michael Mitzen- macher, and Tim Kraska. 2022. Can Learned Models Replace Hash Functions? PVLDB16, 3 (2022), 532–545. https://doi.org/10.14778/3570690.3570702

work page doi:10.14778/3570690.3570702 2022
[75]

Mohanad Sarhan, Siamak Layeghy, Nour Moustafa, and Marius Portmann. 2020. NetFlow Datasets for Machine Learning-Based Network Intrusion Detection Sys- tems. InProc. 10th EAI International Conference on Big Data Technologies and Appli- cations (LNICST), Vol. 371. 117–135. https://doi.org/10.1007/978-3-030-72802-1_9 Dataset available at https://staff.itee.u...

work page doi:10.1007/978-3-030-72802-1_9 2020
[76]

Atsuki Sato and Yusuke Matsui. 2023. Fast Partitioned Learned Bloom Filter. InProc. 36th Annual Conference on Neural Information Process- ing Systems (NeurIPS). http://papers.nips.cc/paper_files/paper/2023/hash/ 7b2e844c52349134268e819a9b56b9e8-Abstract-Conference.html

work page 2023
[77]

Jürgen Schmidhuber and Stefan Heil. 1996. Sequential neural text compression. IEEE Trans. Neural Networks7, 1 (1996), 142–146. https://doi.org/10.1109/72. 478398

work page doi:10.1109/72 1996
[78]

Claude E Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423

work page 1948
[79]

Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. 2022. Space- efficient representation of genomic k-mer count tables.Algorithms Mol. Biol.17, 1 (2022), 5. https://doi.org/10.1186/S13015-022-00212-0

work page doi:10.1186/s13015-022-00212-0 2022
[80]

Samuel D. Stearns. 1995. Arithmetic coding in lossless waveform compression. IEEE Trans. Signal Process.43, 8 (1995), 1874–1879. https://doi.org/10.1109/78. 403346

work page doi:10.1109/78 1995

Showing first 80 references.

[1] [1]

The Gene Ontology resource: enriching a GOld mine.Nucleic acids research 49, D1 (2021), D325–D334

2021. The Gene Ontology resource: enriching a GOld mine.Nucleic acids research 49, D1 (2021), D325–D334

work page 2021

[2] [2]

Stephen Alstrup, Gerth Brodal, and Theis Rauhe. 2001. Optimal static range reporting in one dimension. InProc. 33rd Annual ACM Symposium on Theory of Computing (STOC). 476–482

work page 2001

[3] [3]

Stones, Gang Wang, Xiaoguang Liu, Jing Liu, and Sheng Lin

Naiyong Ao, Fan Zhang, Di Wu, Douglas S. Stones, Gang Wang, Xiaoguang Liu, Jing Liu, and Sheng Lin. 2011. Efficient Parallel Lists Intersection and Index Compression Algorithms Using Graphics Processing Units.PVLDB4, 8 (2011), 470–481. https://doi.org/10.14778/2002974.2002975 12

work page doi:10.14778/2002974.2002975 2011

[4] [4]

Sepehr Assadi, Martin Farach-Colton, and William Kuszmaul. 2023. Tight Bounds for Monotone Minimal Perfect Hashing. InProc. 34th Annual ACM- SIAM Symposium on Discrete Algorithms (SODA). 456–476. https://doi.org/10. 1137/1.9781611977554.CH20

work page 2023

[5] [5]

Garofalakis, and Rajeev Rastogi

Shivnath Babu, Minos N. Garofalakis, and Rajeev Rastogi. 2001. SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables. InProc. 28th ACM International Conference on Management of Data (SIGMOD). 283–294. https://doi.org/10.1145/375663.375693

work page doi:10.1145/375663.375693 2001

[6] [6]

Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. 2010. Fast prefix search in little space, with applications. InProc. 18th Annual European Symposium on Algorithms (ESA). 427–438

work page 2010

[7] [7]

Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. 2011. The- ory and practice of monotone minimal perfect hashing.ACM J. Exp. Algorithmics 16 (2011). https://doi.org/10.1145/1963190.2025378

work page doi:10.1145/1963190.2025378 2011

[8] [8]

Djamal Belazzougui and Rossano Venturini. 2013. Compressed static functions with applications. InProc. 24th Annual ACM-SIAM Symposium on Discrete Algo- rithms (SODA). SIAM, 229–240. https://doi.org/10.1137/1.9781611973105.17

work page doi:10.1137/1.9781611973105.17 2013

[9] [9]

Bender, Alex Conway, Martin Farach-Colton, William Kuszmaul, and Guido Tagliavini

Michael A. Bender, Alex Conway, Martin Farach-Colton, William Kuszmaul, and Guido Tagliavini. 2023. Iceberg Hashing: Optimizing Many Hash-Table Criteria at Once.J. ACM70, 6 (2023), 40:1–40:51. https://doi.org/10.1145/3625817

work page doi:10.1145/3625817 2023

[10] [10]

Bercea, Jakob Bæk Tejs Houen, and Rasmus Pagh

Ioana O. Bercea, Jakob Bæk Tejs Houen, and Rasmus Pagh. 2024. Daisy Bloom Filters. InProc. 19th Scandinavian Symposium and Workshops on Algorithm Theory (SW AT). 9:1–9:19. https://doi.org/10.4230/LIPICS.SWAT.2024.9

work page doi:10.4230/lipics.swat.2024.9 2024

[11] [11]

Jock Blackard. 1998. Covertype. UCI Machine Learning Repository. https: //doi.org/10.24432/C50K5N

work page doi:10.24432/c50k5n 1998

[12] [12]

Burton H. Bloom. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors.Commun. ACM13, 7 (1970), 422–426. https://doi.org/10.1145/362686. 362692

work page doi:10.1145/362686 1970

[13] [13]

Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. 2022. A Learned Approach to Design Compressed Rank/Select Data Structures.ACM Trans. on Algorithms18, 3, Article 24 (oct 2022), 28 pages. https://doi.org/10.1145/3524060

work page doi:10.1145/3524060 2022

[14] [14]

Botelho, Yoshiharu Kohayakawa, and Nivio Ziviani

Fabiano C. Botelho, Yoshiharu Kohayakawa, and Nivio Ziviani. 2005. A Prac- tical Minimal Perfect Hashing Method. InProc. 4th International Workshop on Experimental and Efficient Algorithms (WEA). 488–500. https://doi.org/10.1007/ 11427186_42

work page 2005

[15] [15]

Botelho, Rasmus Pagh, and Nivio Ziviani

Fabiano C. Botelho, Rasmus Pagh, and Nivio Ziviani. 2013. Practical Perfect Hashing in Nearly Optimal Space.Inf. Syst.38, 1 (2013), 108–131. https://doi. org/10.1016/j.is.2012.06.002

work page doi:10.1016/j.is.2012.06.002 2013

[16] [16]

Jehoshua Bruck, Jie Gao, and Anxiao Jiang. 2006. Weighted Bloom filter. In Proc. 39th IEEE International Symposium on Information Theory (ISIT). 2304–2308. https://doi.org/10.1109/ISIT.2006.261978

work page doi:10.1109/isit.2006.261978 2006

[17] [17]

Hayoung Byun and Hyesook Lim. 2022. Learned FBF: Learning-Based Functional Bloom Filter for Key-Value Storage.IEEE Trans. Computers71, 8 (2022), 1928–1938. https://doi.org/10.1109/TC.2021.3112079

work page doi:10.1109/tc.2021.3112079 2022

[18] [18]

Benjamin Coleman, David Torres Ramos, Vihan Lakshman, Chen Luo, and An- shumali Shrivastava. 2023. CARAMEL: A Succinct Read-Only Lookup Table via Compressed Static Functions. (2023). https://doi.org/10.48550/arXiv.2305.16545

work page doi:10.48550/arxiv.2305.16545 2023

[19] [19]

Colin Cooper. 2000. On the rank of random matrices.Random Struct. Algorithms 16, 2 (2000), 209–232. https://doi.org/10.1002/(SICI)1098-2418(200003)16:2% 3C209::AID-RSA6%3E3.0.CO;2-1

work page doi:10.1002/(sici)1098-2418(200003)16:2 2000

[20] [20]

Zhenwei Dai and Anshumali Shrivastava. 2020. Adaptive Learned Bloom Filter (Ada-BF): Efficient Utilization of the Classifier with Application to Real-Time Information Filtering on the Web. InProc. 33rd Annual Conference on Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/ 2020/hash/86b94dae7c6517ec1ac767fd2c136580-Abst...

work page 2020

[21] [21]

Zhenwei Dai, Anshumali Shrivastava, Pedro Reviriego, and José Alberto Hernán- dez. 2022. Optimizing Learned Bloom Filters: How Much Should Be Learned? IEEE Embed. Syst. Lett.14, 3 (2022), 123–126. https://doi.org/10.1109/LES.2022. 3156019

work page doi:10.1109/les.2022 2022

[22] [22]

Scott Davies and Andrew W. Moore. 1999. Bayesian Networks for Lossless Dataset Compression. InProc. 5th ACM International Conference on Knowledge Discovery and Data Mining, (KDD). 387–391. https://doi.org/10.1145/312129. 312289

work page doi:10.1145/312129 1999

[23] [23]

1981.Assessing Probability Assessors: Calibration and Refinement.Technical Report

Morris H DeGroot and Stephen E Fienberg. 1981.Assessing Probability Assessors: Calibration and Refinement.Technical Report

work page 1981

[24] [24]

Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, and Joel Veness. 2024. Language Modeling Is Compression. InProc. 12th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=jznbgiynus

work page 2024

[25] [25]

Martin Dietzfelbinger and Rasmus Pagh. 2008. Succinct Data Structures for Retrieval and Approximate Membership (Extended Abstract). InProc. 35th Inter- national Colloquium on Automata, Languages, and Programming (ICALP). 385–396. https://doi.org/10.1007/978-3-540-70575-8_32

work page doi:10.1007/978-3-540-70575-8_32 2008

[26] [26]

Dillinger, Lorenz Hübschle-Schneider, Peter Sanders, and Stefan Walzer

Peter C. Dillinger, Lorenz Hübschle-Schneider, Peter Sanders, and Stefan Walzer

work page

[27] [27]

Fast Succinct Retrieval and Approximate Membership Using Ribbon. In Proc. 20th International Symposium on Experimental Algorithms (SEA). 4:1–4:20. https://doi.org/10.4230/LIPICS.SEA.2022.4

work page doi:10.4230/lipics.sea.2022.4 2022

[28] [28]

Paolo Ferragina, Hans-Peter Lehmann, Peter Sanders, and Giorgio Vinciguerra

work page

[29] [29]

CPM.2016.13,doi:10.4230/LIPICS.CPM.2016.13

Learned monotone minimal perfect hashing. InProceedings of the 31st Annual European Symposium on Algorithms (ESA). https://doi.org/10.4230/LIPIcs. ESA.2023.46

work page doi:10.4230/lipics 2023

[30] [30]

Paolo Ferragina, Giovanni Manzini, and Giorgio Vinciguerra. 2022. Compressing and Querying Integer Dictionaries Under Linearities and Repetitions.IEEE Access 10 (2022), 118831–118848. https://doi.org/10.1109/ACCESS.2022.3221520

work page doi:10.1109/access.2022.3221520 2022

[31] [31]

Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds.PVLDB13, 8 (2020), 1162–1175. https://doi.org/10.14778/3389133.3389135

work page doi:10.14778/3389133.3389135 2020

[32] [32]

Fredman, János Komlós, and Endre Szemerédi

Michael L. Fredman, János Komlós, and Endre Szemerédi. 1984. Storing a Sparse Table with 𝑂( 1) Worst Case Access Time.J. ACM31, 3 (1984), 538–544. https: //doi.org/10.1145/828.1884

work page doi:10.1145/828.1884 1984

[33] [33]

Gallager

Robert G. Gallager. 1978. Variations on a theme by Huffman.IEEE Trans. Inf. Theory24, 6 (1978), 668–674. https://doi.org/10.1109/TIT.1978.1055959

work page doi:10.1109/tit.1978.1055959 1978

[34] [34]

Parameswaran

Yihan Gao and Aditya G. Parameswaran. 2016. Squish: Near-Optimal Com- pression for Archival of Relational Datasets. InProc. 22nd ACM International Conference on Knowledge Discovery and Data Mining (KDD). 1575–1584. https: //doi.org/10.1145/2939672.2939867

work page doi:10.1145/2939672.2939867 2016

[35] [35]

Marco Genuzio, Giuseppe Ottaviano, and Sebastiano Vigna. 2020. Fast scalable construction of ([compressed] static | minimal perfect hash) functions.Inf. Comput.273 (2020), 104517. https://doi.org/10.1016/J.IC.2020.104517

work page doi:10.1016/j.ic.2020.104517 2020

[36] [36]

Shelton, Andrew S

Dave Gomboc, Christian R. Shelton, Andrew S. Miner, and Gianfranco Ciardo

work page

[37] [37]

Comparing Lossless Compression Methods for Chess Endgame Data. In Proc. 27th European Conference on Artificial Intelligence (ECAI), Vol. 392. IOS Press, 4116–4123. https://doi.org/10.3233/FAIA240982

work page doi:10.3233/faia240982

[38] [38]

Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, and Idoia Ochoa. 2019. DeepZip: Lossless Data Compression Using Recurrent Neural Networks. InProc. 29th Data Compression Conference (DCC). 575. https://doi.org/10.1109/DCC.2019. 00087

work page doi:10.1109/dcc.2019 2019

[39] [39]

Thomas Mueller Graf and Daniel Lemire. 2020. Xor Filters.ACM J. Exp. Algo- rithmics25 (2020), 1–16. https://doi.org/10.1145/3376122

work page doi:10.1145/3376122 2020

[40] [40]

Thomas Mueller Graf and Daniel Lemire. 2022. Binary Fuse Filters: Fast and Smaller Than Xor Filters.ACM J. Exp. Algorithmics27 (2022), 1.5:1–1.5:15. https://doi.org/10.1145/3510449

work page doi:10.1145/3510449 2022

[41] [41]

Andrea Guerra, Giorgio Vinciguerra, Antonio Boffa, and Paolo Ferragina. 2025. Learned Compression of Nonlinear Time Series With Random Access. InProc. 41st IEEE International Conference on Data Engineering (ICDE). 1579–1592. https: //doi.org/10.1109/ICDE65448.2025.00122

work page doi:10.1109/icde65448.2025.00122 2025

[42] [42]

Torben Hagerup and Torsten Tholey. 2001. Efficient Minimal Perfect Hashing in Nearly Minimal Space. InProc. 18th Annual Symposium on Theoretical As- pects of Computer Science (STACS), Vol. 2010. 317–326. https://doi.org/10.1007/ 3-540-44693-1_28

work page 2001

[43] [43]

Stefan Hermann. 2025. MorphisHash: Improving Space Efficiency of ShockHash for Minimal Perfect Hashing.arXiv preprint arXiv:2503.10161(2025)

work page arXiv 2025

[44] [44]

Hreinsson, Morten Krøyer, and Rasmus Pagh

Jóhannes B. Hreinsson, Morten Krøyer, and Rasmus Pagh. 2009. Storing a Compressed Function with Constant Time Access. InProc. 17th Annual Eu- ropean Symposium on Algorithms (ESA). 730–741. https://doi.org/10.1007/ 978-3-642-04128-0_65

work page 2009

[45] [45]

David A Huffman. 2007. A method for the construction of minimum-redundancy codes.Proceedings of the IRE40, 9 (2007), 1098–1101

work page 2007

[46] [46]

Amir Ilkhechi, Andrew Crotty, Alex Galakatos, Yicong Mao, Grace Fan, Xiran Shi, and Ugur Cetintemel. 2020. DeepSqueeze: Deep Semantic Compression for Tabular Data. InProc. 46th ACM International Conference on Management of Data (SIGMOD). 1733–1746. https://doi.org/10.1145/3318464.3389734

work page doi:10.1145/3318464.3389734 2020

[47] [47]

Amitansh Joshi, Amit Parolkar, and Vedant Das. 2023. Spotify_1Million_Tracks. https://doi.org/10.34740/KAGGLE/DSV/5987852

work page doi:10.34740/kaggle/dsv/5987852 2023

[48] [48]

Minoru Kanehisa, Miho Furumichi, Yoko Sato, Mari Ishiguro-Watanabe, and Mao Tanabe. 2021. KEGG: integrating viruses and cellular organisms.Nucleic acids research49, D1 (2021), D545–D551

work page 2021

[49] [49]

Kiss, Éva Hosszu, János Tapolcai, Lajos Rónyai, and Ori Rottenstreich

Sándor Z. Kiss, Éva Hosszu, János Tapolcai, Lajos Rónyai, and Ori Rottenstreich

work page

[50] [50]

Bloom Filter With a False Positive Free Zone.IEEE Trans. Netw. Serv. Manag.18, 2 (2021), 2334–2349. https://doi.org/10.1109/TNSM.2021.3059075

work page doi:10.1109/tnsm.2021.3059075 2021

[51] [51]

1998.The art of computer programming: Volume 3: Sorting and Searching

Donald E Knuth. 1998.The art of computer programming: Volume 3: Sorting and Searching. Addison-Wesley Professional

work page 1998

[52] [52]

Puglisi, and Rajeev Raman

Dominik Köppl, Simon J. Puglisi, and Rajeev Raman. 2022. Fast and Simple Compact Hashing via Bucketing.Algorithmica84, 9 (2022), 2735–2766. https: //doi.org/10.1007/S00453-022-00996-Y

work page doi:10.1007/s00453-022-00996-y 2022

[53] [53]

Chi, Jeffrey Dean, and Neoklis Polyzotis

Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. InProc. 44th ACM International Conference on Management of Data (SIGMOD). 489–504. https://doi.org/10.1145/3183713. 3196909

work page doi:10.1145/3183713 2018

[54] [54]

Oguzhan Külekci

M. Oguzhan Külekci. 2014. Enhanced Variable-Length Codes: Improved Com- pression with Efficient Random Access. InProc. 24th Data Compression Conference (DCC). 362–371. https://doi.org/10.1109/DCC.2014.74

work page doi:10.1109/dcc.2014.74 2014

[55] [55]

Florian Kurpicz, Hans-Peter Lehmann, and Peter Sanders. 2023. PaCHash: Packed and compressed hash tables. InProc. 25th Symposium on Algorithm Engineering 13 and Experiments (ALENEX). 162–175. https://doi.org/10.1137/1.9781611977561. ch14

work page doi:10.1137/1.9781611977561 2023

[56] [56]

Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. 2023. SicHash - Small Irregular Cuckoo Tables for Perfect Hashing. InProc. 25th Symposium on Algo- rithm Engineering and Experiments (ALENEX). 176–189. https://doi.org/10.1137/ 1.9781611977561.ch15

work page 2023

[57] [57]

Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. 2024. ShockHash: To- wards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force. InProc. 26th Symposium on Algorithm Engineering and Experiments (ALENEX). 194–206. https://doi.org/10.1137/1.9781611977929.15

work page doi:10.1137/1.9781611977929.15 2024

[58] [58]

Hans-Peter Lehmann, Thomas Mueller, Rasmus Pagh, Giulio Ermanno Pibiri, Peter Sanders, Sebastiano Vigna, and Stefan Walzer. 2025. Modern Minimal Perfect Hashing: A Survey.CoRRabs/2506.06536 (2025). https://doi.org/10. 48550/ARXIV.2506.06536

work page arXiv 2025

[59] [59]

Hanwen Liu, Mihail Stoian, Alexander van Renen, and Andreas Kipf. 2024. Corra: Correlation-Aware Column Compression. InProc. 2nd Workshop on Cloud Databases (CloudDB). https://vldb.org/workshops/2024/proceedings/CloudDB/ clouddb-2.pdf

work page 2024

[60] [60]

Yihao Liu, Xinyu Zeng, and Huanchen Zhang. 2024. LeCo: Lightweight Com- pression via Learning Serial Correlations.Proc. ACM Manag. Data2, 1, Article 65 (mar 2024), 28 pages. https://doi.org/10.1145/3639320

work page doi:10.1145/3639320 2024

[61] [61]

Dario Malchiodi, Davide Raimondi, Giacomo Fumagalli, Raffaele Giancarlo, and Marco Frasca. 2024. The role of classifiers and data complexity in learned Bloom filters: insights and recommendations.J. Big Data11, 1 (2024), 45. https: //doi.org/10.1186/S40537-024-00906-9

work page doi:10.1186/s40537-024-00906-9 2024

[62] [62]

Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and Optimizing by Sandwiching. InProc. 31st Annual Conference on Neural Information Processing Systems (NeurIPS). 462–471. https://proceedings.neurips.cc/paper/2018/hash/ 0f49c89d1e7298bb9930789c8ed59d48-Abstract.html

work page 2018

[63] [63]

Christian Worm Mortensen, Rasmus Pagh, and Mihai Pˇatraçcu. 2005. On dynamic range reporting in one dimension. InProc. 37th annual ACM Symposium on Theory of Computing (STOC). 104–111

work page 2005

[64] [64]

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtain- ing well calibrated probabilities using bayesian binning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 29

work page 2015

[65] [65]

2016.Compact data structures: a practical approach

Gonzalo Navarro. 2016.Compact data structures: a practical approach. Cambridge University Press

work page 2016

[66] [66]

Rasmus Pagh and Flemming Friche Rodler. 2004. Cuckoo hashing.J. Algorithms 51, 2 (2004), 122–144. https://doi.org/10.1016/j.jalgor.2003.12.002

work page doi:10.1016/j.jalgor.2003.12.002 2004

[67] [67]

Ripon Patgiri, Anupam Biswas, and Sabuzima Nayak. 2023. deepBF: Malicious URL detection using learned Bloom Filter and evolutionary deep learning.Com- put. Commun.200 (2023), 30–41. https://doi.org/10.1016/J.COMCOM.2022.12.027

work page doi:10.1016/j.comcom.2022.12.027 2023

[68] [68]

Mihai Pătraşcu. 2008. Succincter. InProc. 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS). 305–313. https://doi.org/10.1109/FOCS. 2008.83

work page doi:10.1109/focs 2008

[69] [69]

Ely Porat. 2009. An Optimal Bloom Filter Replacement Based on Matrix Solving. InProc. 4th International Computer Science Symposium in Russia (CSR). 263–273. https://doi.org/10.1007/978-3-642-03351-3_25

work page doi:10.1007/978-3-642-03351-3_25 2009

[70] [70]

Arvind Prasad and Shalini Chandra. 2024. PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning.Comput. Secur.136 (2024), 103545. https://doi.org/10.1016/ J.COSE.2023.103545 Dataset available at https://archive.ics.uci.edu/dataset/967/ phiusiil+phishing+url+dataset

work page arXiv 2024

[71] [71]

Rae, Sergey Bartunov, and Timothy P

Jack W. Rae, Sergey Bartunov, and Timothy P. Lillicrap. 2019. Meta-Learning Neural Bloom Filters. InProc. 36th International Conference on Machine Learning (ICML). 5271–5280. http://proceedings.mlr.press/v97/rae19a.html

work page 2019

[72] [72]

Julian Reichinger, Thomas Krismayer, and Jan Rellermeyer. 2024. COPR–Efficient, large-scale log storage and retrieval.arXiv preprint arXiv:2402.18355(2024)

work page arXiv 2024

[73] [73]

Pedro Reviriego, José Alberto Hernández, Zhenwei Dai, and Anshumali Shrivas- tava. 2021. Learned Bloom Filters in Adversarial Environments: A Malicious URL Detection Use-Case. InProc. 22nd IEEE International Conference on High Perfor- mance Switching and Routing (HPSR). 1–6. https://doi.org/10.1109/HPSR52026. 2021.9481857

work page doi:10.1109/hpsr52026 2021

[74] [74]

Ibrahim Sabek, Kapil Vaidya, Dominik Horn, Andreas Kipf, Michael Mitzen- macher, and Tim Kraska. 2022. Can Learned Models Replace Hash Functions? PVLDB16, 3 (2022), 532–545. https://doi.org/10.14778/3570690.3570702

work page doi:10.14778/3570690.3570702 2022

[75] [75]

Mohanad Sarhan, Siamak Layeghy, Nour Moustafa, and Marius Portmann. 2020. NetFlow Datasets for Machine Learning-Based Network Intrusion Detection Sys- tems. InProc. 10th EAI International Conference on Big Data Technologies and Appli- cations (LNICST), Vol. 371. 117–135. https://doi.org/10.1007/978-3-030-72802-1_9 Dataset available at https://staff.itee.u...

work page doi:10.1007/978-3-030-72802-1_9 2020

[76] [76]

Atsuki Sato and Yusuke Matsui. 2023. Fast Partitioned Learned Bloom Filter. InProc. 36th Annual Conference on Neural Information Process- ing Systems (NeurIPS). http://papers.nips.cc/paper_files/paper/2023/hash/ 7b2e844c52349134268e819a9b56b9e8-Abstract-Conference.html

work page 2023

[77] [77]

Jürgen Schmidhuber and Stefan Heil. 1996. Sequential neural text compression. IEEE Trans. Neural Networks7, 1 (1996), 142–146. https://doi.org/10.1109/72. 478398

work page doi:10.1109/72 1996

[78] [78]

Claude E Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423

work page 1948

[79] [79]

Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. 2022. Space- efficient representation of genomic k-mer count tables.Algorithms Mol. Biol.17, 1 (2022), 5. https://doi.org/10.1186/S13015-022-00212-0

work page doi:10.1186/s13015-022-00212-0 2022

[80] [80]

Samuel D. Stearns. 1995. Arithmetic coding in lossless waveform compression. IEEE Trans. Signal Process.43, 8 (1995), 1874–1879. https://doi.org/10.1109/78. 403346

work page doi:10.1109/78 1995