Bloom Filter Encoding for Machine Learning

Ionut Cardei; John Cartmell; Mihaela Cardei

arxiv: 2512.19991 · v2 · submitted 2025-12-23 · 💻 cs.LG

Bloom Filter Encoding for Machine Learning

John Cartmell , Mihaela Cardei , Ionut Cardei This is my paper

Pith reviewed 2026-05-16 20:40 UTC · model grok-4.3

classification 💻 cs.LG

keywords Bloom filtermachine learning preprocessingfeature encodingmemory reductiondata obfuscationclassification performancedimensionality reduction

0 comments

The pith

Bloom filter encodings let machine learning models reach accuracy close to raw data while using less memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a preprocessing step that converts each data sample into a fixed-length bit array through hash-based Bloom filter encoding. This produces a compact representation that lowers memory needs and hides original feature values. Tests across text, time-series, tabular, and image datasets with classifiers such as XGBoost, deep neural networks, convolutional networks, and logistic regression show performance stays comparable to training on raw inputs or standard dimensionality reductions. The approach is presented as a general-purpose method that keeps useful similarity structure intact for learning tasks.

Core claim

Bloom filter encoding maps each sample to a compact bit-array feature space via hash functions, yielding a fixed-length representation that supports classification accuracy comparable to raw data or conventional reductions while delivering consistent memory savings and optional obfuscation of original values.

What carries the argument

Bloom filter transform that hashes sample features into a fixed-length bit array.

Load-bearing premise

The hash-based encoding must retain enough similarity structure and discriminative information from the original features for standard classifiers to reach comparable accuracy without domain-specific adjustments.

What would settle it

A new dataset or classifier where Bloom-encoded inputs produce clearly lower accuracy than raw data while standard reductions maintain performance would show the encoding fails to preserve necessary structure in general.

Figures

Figures reproduced from arXiv: 2512.19991 by Ionut Cardei, John Cartmell, Mihaela Cardei.

**Figure 1.** Figure 1: Bloom filter construction/test process for item not a member of set. The false positive rate (FPR) depends on p1 and is denoted by [5]: FPR = (1 − e −kn/m) k = p k 1 (3) Bit occupancy, collision rate, and FPR are proportional: p1 ↑ ⇒ collision rate ↑ ⇒ FPR ↑ (4) The compression ratio is measured as [4]: Compression Ratio (CR) = n · S m (5) where S is element size in bits. For example, if S is a byte it is … view at source ↗

**Figure 2.** Figure 2: Training/Test data split, processing from n-feature samples to m-bit Bloom filter array, model training, inference and analysis of 200 heartbeat recordings, evenly split between normal and abnormal rhythms [9]. Tabular datasets include Adult 50K (48,842 instances, 14 features, income classification) [10] and CDC Diabetes (250,000 instances, 23 features) [11]. Image datasets comprise MNIST (70,000 handwritt… view at source ↗

**Figure 3.** Figure 3: Process for transforming samples into Bloom Bit Arrays for classifier training. output is reduced modulo the Bloom filter size m: hash_val = HMAC(K, f, v) mod m. (11) Algorithm 1: Bloom Filter Encoding of Samples Input: List of samples, Bloom filter size m, number of hash functions k, secret key K Output: List of Bloom filters BF_List, one per sample Initialize empty list BF_List ← [ ]; foreach sample s in… view at source ↗

**Figure 4.** Figure 4: Accuracy by Dataset, Classifier, and Transform (Color = Transform, Hatch = Dataset). and 4.17× vs. 0.29× on CDC) and higher accuracy across the Adult 50K and CDC Diabetes datasets, resulting in a more accurate classifier with a smaller memory footprint. Spam - Bloom EKG - Bloom Adult - Bloom Adult - PCA Adult - LDA CDC - Bloom CDC - PCA CDC - LDA MNIST - Bloom Fashion MNIST - Bloom Dataset - Transform 0 2 … view at source ↗

**Figure 5.** Figure 5: Compression Ratio by Dataset and Transform (Color = Transform, Hatch = Dataset). 5.3 Entropy and Bit Occupancy [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Entropy and Bit Occupancy per Dataset for Encoded Bloom Filter Transform (Color = Metric, Hatch = Dataset). per bit in the encoded Bloom filter representations. The bit occupancy values are 0.13 at the low-end up to 0.60. These results show that the encoded Bloom filters are not fully saturated and maintain a moderate fraction of bits set to 1. Datasets with higher bit occupancy, such as the EKG and MNIST,… view at source ↗

**Figure 7.** Figure 7: SMS Spam Dataset - Sweeping Bloom Filter size and number of hash functions. too small, multiple features collide into the same bit positions, which increases information loss and reduces classifier fidelity. The compression ratio is also shown with various encoded Bloom filter sizes producing different memory savings. For instance, reducing m from 200 to 50 increases the compression ratio from approximatel… view at source ↗

read the original abstract

We present a method that uses a Bloom filter transform to preprocess data for machine learning. Each sample is encoded into a compact bit-array representation using hash-based encoding, producing a fixed-length feature space that reduces memory usage and obfuscates original feature values. The encoding does not rely on keyed hashing; however, a key can optionally be used to control the mapping and would be required to reproduce the representation. We evaluate the approach on six datasets spanning text, time-series, tabular, and image domains: SMS Spam Collection, ECG200, Adult 50K, CDC Diabetes, MNIST, and Fashion MNIST. Four classifiers are considered: Extreme Gradient Boosting, Deep Neural Networks, Convolutional Neural Networks, and Logistic Regression. Results show that models trained on Bloom filter encodings achieve performance comparable to models trained on raw data or standard dimensionality reduction techniques across several datasets, while providing consistent memory savings. These findings suggest that Bloom filter encodings can serve as an efficient, general-purpose pre-processing representation that preserves useful similarity structure for learning tasks while providing a degree of data obfuscation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bloom filter encoding delivers comparable accuracy with memory cuts on the tested sets, but missing parameter details and hashing mechanics for non-set data leave the generality unproven.

read the letter

Bloom filter encoding gives comparable accuracy on the tested sets with memory savings, but without parameter details or collision analysis it's hard to see how general this is. The paper takes the old Bloom filter trick and applies it as a fixed bit-vector preprocessor across text, time series, tabular, and image data. They run it through XGBoost, DNNs, CNNs, and logistic regression on six public datasets and report that performance holds up while memory use drops. That practical angle is the main thing worth noting if you need a lightweight way to compress features or add basic obfuscation without extra keys. The experiments stick to standard classifiers and public data, so the setup itself is straightforward to check. The soft spots sit in the evaluation. The abstract talks about comparable results but supplies no accuracy numbers, no error bars, and no values for Bloom filter length or hash count. For tabular and image inputs it is unclear how raw numbers or pixels become hashable tokens, and there is no check on collision rates or whether the bit vectors keep enough class separation. Bloom filters are built for membership queries, not distance preservation, so without those controls the results could depend on lucky parameter choices rather than a reliable property of the transform. This is for applied people who want a simple memory-saving step in pipelines where exact raw features are not required. It does not introduce new theory or large-scale wins, but the idea is concrete enough that a referee could verify the gaps in one pass. Send it to review once the parameter tables and exact results are added; the core claim is testable and the gaps are fixable.

Referee Report

3 major / 2 minor

Summary. The paper proposes a Bloom filter transform to encode ML samples into compact fixed-length bit arrays via hash-based encoding. This preprocessing step is intended to reduce memory usage and provide feature obfuscation while preserving enough similarity structure for standard classifiers to achieve performance comparable to raw data or conventional dimensionality reduction. The method is tested empirically on six datasets (SMS Spam, ECG200, Adult, CDC Diabetes, MNIST, Fashion MNIST) with four classifiers (XGBoost, DNN, CNN, logistic regression).

Significance. If the central empirical claim holds after proper controls, the technique could supply a lightweight, general-purpose preprocessing representation that delivers consistent memory savings together with a modest privacy benefit through one-way hashing. Its simplicity and domain-agnostic nature would make it attractive for resource-limited or privacy-sensitive deployments, provided the encoding demonstrably retains discriminative information without dataset-specific tuning.

major comments (3)

[Method] Method section: the encoding procedure for non-set data (continuous tabular features in Adult/Diabetes, pixel intensities in MNIST) is not specified, including tokenization into hashable elements and whether standard or locality-sensitive hash families are employed. This detail is load-bearing because Bloom filters supply no distance-preservation guarantee; without it, comparable accuracy cannot be attributed to the transform rather than favorable collision rates.
[Experiments] Experiments/Results: no values are reported for Bloom-filter length m or hash count k on any dataset, nor any analysis of resulting false-positive rates or collision statistics. The headline claim of 'comparable performance' therefore cannot be evaluated or reproduced, as these parameters directly control information loss.
[Results] Results section: the abstract asserts 'comparable performance' and 'consistent memory savings' yet supplies neither numerical accuracies, standard deviations, nor statistical tests against raw-data and dimensionality-reduction baselines. Without these metrics the central empirical claim remains unverified.

minor comments (2)

[Abstract] Abstract: the phrase 'a key can optionally be used' should clarify whether the key is required for experimental reproducibility and how it affects the reported results.
[Results] The manuscript should include a table or figure quantifying memory reduction (bits per sample) relative to the original feature dimensionality for each dataset.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us strengthen the clarity and reproducibility of the manuscript. We address each major point below and have made corresponding revisions.

read point-by-point responses

Referee: [Method] Method section: the encoding procedure for non-set data (continuous tabular features in Adult/Diabetes, pixel intensities in MNIST) is not specified, including tokenization into hashable elements and whether standard or locality-sensitive hash families are employed. This detail is load-bearing because Bloom filters supply no distance-preservation guarantee; without it, comparable accuracy cannot be attributed to the transform rather than favorable collision rates.

Authors: We agree that the original manuscript did not provide sufficient detail on the encoding of non-set data. In the revised version we have added an explicit subsection describing the procedure: continuous tabular features are discretized into equal-width bins and each bin is treated as a distinct token; pixel intensities in MNIST/Fashion-MNIST are hashed individually after optional 2x2 patch aggregation. We use standard non-cryptographic hash families (MurmurHash3) rather than locality-sensitive hashes. We acknowledge that Bloom filters provide no formal distance preservation and have added a short discussion clarifying that the observed performance is an empirical result, now supported by the newly reported collision statistics. revision: yes
Referee: [Experiments] Experiments/Results: no values are reported for Bloom-filter length m or hash count k on any dataset, nor any analysis of resulting false-positive rates or collision statistics. The headline claim of 'comparable performance' therefore cannot be evaluated or reproduced, as these parameters directly control information loss.

Authors: We accept this criticism. The revised manuscript now contains a dedicated table listing the exact m and k values chosen for each of the six datasets, the resulting theoretical false-positive rate, and empirical collision counts measured on the encoded training sets. These parameters were selected to keep the false-positive rate below 0.05 while respecting the memory budget; the new table makes the experimental configuration fully reproducible. revision: yes
Referee: [Results] Results section: the abstract asserts 'comparable performance' and 'consistent memory savings' yet supplies neither numerical accuracies, standard deviations, nor statistical tests against raw-data and dimensionality-reduction baselines. Without these metrics the central empirical claim remains unverified.

Authors: We agree that the original results section was insufficiently quantitative. We have expanded it with two new tables: one reporting mean accuracy and standard deviation (over 10 random seeds) for Bloom-filter, raw, and PCA-encoded versions of each dataset; the second showing memory footprint in bytes per sample. We also added paired t-test p-values comparing Bloom-filter performance against the raw-data baseline. These additions directly support the claims made in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or fitted predictions

full rationale

The paper describes a Bloom filter encoding method and reports direct experimental results on six public datasets using four standard classifiers. No equations, derivations, uniqueness theorems, or parameter-fitting steps are present that could reduce a claimed prediction back to the input by construction. Performance comparability is measured via standard accuracy metrics on held-out test sets, making the evaluation self-contained and falsifiable against external benchmarks without reliance on self-citations or ansatzes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that Bloom filter bit arrays retain enough statistical structure for downstream learning. Free parameters include filter length and number of hash functions, which must be chosen but are not quantified in the abstract. No new physical or mathematical entities are postulated.

free parameters (1)

Bloom filter length and hash function count
These control the encoding density and collision rate and are required to reproduce the bit-array representation, yet no specific values or selection method appear in the abstract.

axioms (1)

domain assumption Hash-based encoding into a fixed bit array preserves task-relevant similarity structure across text, time-series, tabular, and image data.
Invoked implicitly when claiming comparable performance without loss of discriminative power.

pith-pipeline@v0.9.0 · 5479 in / 1298 out tokens · 38487 ms · 2026-05-16T20:40:24.279437+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results show that models trained on Bloom filter encodings achieve performance comparable to models trained on raw data...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Privacy-Preserving Distributed Learning in IoT Systems: A Unified Threat Model and Evaluation Framework
cs.CR 2026-05 unverdicted novelty 5.0

A unified threat model and evaluation framework is developed to compare privacy-preserving methods for distributed learning in IoT, showing trade-offs in privacy robustness and system efficiency with Bloom filter enco...

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

RobustBF: A High Accuracy and Memory Efficient 2D Bloom Filter,

S. Nayak and R. Patgiri, “RobustBF: A High Accuracy and Memory Efficient 2D Bloom Filter,”arXiv preprint, 2021, arXiv:2106.04365

work page arXiv 2021
[2]

A Privacy Model for Classical & Learned Bloom Filters,

H. Tirmazi, “A Privacy Model for Classical & Learned Bloom Filters,”arXiv preprint, 2025, arXiv:2501.15751

work page arXiv 2025
[3]

Space/time trade-offs in hash coding with allowable errors,

B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,”Commu- nications of the ACM, vol. 13, no. 7, pp. 422–426, 1970, doi: 10.1145/362686.362692

work page doi:10.1145/362686.362692 1970
[4]

Compressed Bloom Filters,

M. Mitzenmacher, “Compressed Bloom Filters,”IEEE/ACM Transactions on Net- working, vol. 10, no. 5, pp. 604–612, 2002, doi: 10.1109/TNET.2002.803823

work page doi:10.1109/tnet.2002.803823 2002
[5]

Network Applications of Bloom Filters: A Sur- vey,

A. Broder and M. Mitzenmacher, “Network Applications of Bloom Filters: A Sur- vey,”Internet Mathematics, vol. 1, no. 4, pp. 485–509, 2005, doi: 10.1080/15427951. 2004.10129096

work page doi:10.1080/15427951 2005
[6]

DGA Detection Using Similarity-Preserving Bloom Encod- ings,

L. Nitz and A. Mandal, “DGA Detection Using Similarity-Preserving Bloom Encod- ings,” inProc. European Interdisciplinary Cybersecurity Conference (EICC), Sta- vanger, Norway, June 14-15 2023, pp. 116-120, doi:10.1145/3590777.3590795

work page doi:10.1145/3590777.3590795 2023
[7]

A blinded evaluation of privacy pre- serving record linkage with Bloom filters,

S. Randall, H. Wichmann, A. Brownet al., “A blinded evaluation of privacy pre- serving record linkage with Bloom filters,”BMC Medical Research Methodology, vol. 22, no. 1, Art. 22, Jan. 2022, doi: 10.1186/s12874-022-01510-2. 14 J. Cartmell et al

work page doi:10.1186/s12874-022-01510-2 2022
[8]

SMS Spam Collection Dataset,

UCI Machine Learning Repository, “SMS Spam Collection Dataset,” Kaggle, 2011

work page 2011
[9]

Olszewski,Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data, Ph.D

R. Olszewski,Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data, Ph.D. dissertation, Carnegie Mellon University, 2001

work page 2001
[10]

UCI Machine Learning Repository: Adult Data Set,

D. Dua and C. Graff, “UCI Machine Learning Repository: Adult Data Set,” 1996

work page 1996
[11]

Diabetes Health Indicators Dataset,

Centers for Disease Control and Prevention (CDC), “Diabetes Health Indicators Dataset,” 2020

work page 2020
[12]

MNIST Handwritten Digit Database,

Y. LeCun and C. Cortes, “MNIST Handwritten Digit Database,” 1998

work page 1998
[13]

Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms,

X. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms,” 2017

work page 2017
[14]

Chen and C

T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” inPro- ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining (KDD), San Francisco, CA, USA, 2016, pp. 785-794, doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[15]

W., et al., 2019, Science, DOI: 10.1126/sci- ence.aaw5903 Barsdell et al., 2010, MNRAS, 408,

G. E. Hinton and R. R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks,”Science, vol. 313, no. 5786, pp. 504–507, 2006, doi: 10.1126/sci- ence.1127647

work page doi:10.1126/sci- 2006
[16]

ImageNet Classification with Deep Convolutional Neural Networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” inAdvances in Neural Information Processing Sys- tems (NeurIPS), vol. 25, 2012

work page 2012
[17]

The Regression Analysis of Binary Sequences,

D. R. Cox, “The Regression Analysis of Binary Sequences,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 20, no. 2, pp. 215–242, 1958

work page 1958
[18]

Invertible Bloom Lookup Tables with Less Memory and Randomness,

N. Fleischhacker, K. G. Larsen, M. Obremski, and M. Simkin, “Invertible Bloom Lookup Tables with Less Memory and Randomness,” inProc. 32nd Annu. Eur. Symp. Algorithms (ESA 2024), Leibniz Int. Proc. Informatics (LIPIcs), vol. 308, pp. 54:1–54:17, 2024, doi: 10.4230/LIPIcs.ESA.2024.54

work page doi:10.4230/lipics.esa.2024.54 2024
[19]

Sampling and Reconstruction Using Bloom Filters

N. Sengupta, A. Bagchi, S. Bedathur, and M. Ramanath, “Sampling and Recon- struction Using Bloom Filters,”arXiv preprint arXiv:1701.03308, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

HMAC: Keyed-hashing for mes- sage authentication,

H. Krawczyk, M. Bellare, and R. Canetti, “HMAC: Keyed-hashing for mes- sage authentication,”RFC 2104, Internet Eng. Task Force (IETF), 1997, doi: 10.17487/RFC2104

work page doi:10.17487/rfc2104 1997
[21]

Improved visualization of high-dimensional data using the distance-of-distances transformation,

J. Liu and M. Vinck, “Improved visualization of high-dimensional data using the distance-of-distances transformation,”Journal of Big Data, vol. 9, no. 1, Art. 72, 2022, doi: 10.1186/s40537-022-00525-5

work page doi:10.1186/s40537-022-00525-5 2022
[22]

The use of multiple measurements in taxonomic problems,

R. A. Fisher, “The use of multiple measurements in taxonomic problems,”Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936

work page 1936
[23]

I. T. Jolliffe,Principal Component Analysis, 2nd ed., Springer Series in Statistics. New York, NY, USA: Springer, 2002

work page 2002
[24]

Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE),

F. Anowar, S. Sadaoui, and B. Selim, “Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE),”Computer Science Review, vol. 40, p. 100378, 2021, doi: 10.1016/j.cosrev.2021.100378

work page doi:10.1016/j.cosrev.2021.100378 2021
[25]

Privacy- preserving record linkage using Bloom filters: A systematic literature review,

H. S. P. Vidanage, T. Ranbaduge, P. Christen, and D. Vatsalan, “Privacy- preserving record linkage using Bloom filters: A systematic literature review,” J. Inf. Secur. Appl., vol. 54, p. 102493, 2020, doi: 10.1016/j.jisa.2020.102493

work page doi:10.1016/j.jisa.2020.102493 2020
[26]

Keyed-Hash Message Authentication Code (HMAC): Specification and Recommendations (Initial Public Draft),

M. Sönmez Turan and L. T. A. N. Brandão, “Keyed-Hash Message Authentication Code (HMAC): Specification and Recommendations (Initial Public Draft),” NIST Special Publication 800-224 (IPD), Jun. 2024

work page 2024

[1] [1]

RobustBF: A High Accuracy and Memory Efficient 2D Bloom Filter,

S. Nayak and R. Patgiri, “RobustBF: A High Accuracy and Memory Efficient 2D Bloom Filter,”arXiv preprint, 2021, arXiv:2106.04365

work page arXiv 2021

[2] [2]

A Privacy Model for Classical & Learned Bloom Filters,

H. Tirmazi, “A Privacy Model for Classical & Learned Bloom Filters,”arXiv preprint, 2025, arXiv:2501.15751

work page arXiv 2025

[3] [3]

Space/time trade-offs in hash coding with allowable errors,

B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,”Commu- nications of the ACM, vol. 13, no. 7, pp. 422–426, 1970, doi: 10.1145/362686.362692

work page doi:10.1145/362686.362692 1970

[4] [4]

Compressed Bloom Filters,

M. Mitzenmacher, “Compressed Bloom Filters,”IEEE/ACM Transactions on Net- working, vol. 10, no. 5, pp. 604–612, 2002, doi: 10.1109/TNET.2002.803823

work page doi:10.1109/tnet.2002.803823 2002

[5] [5]

Network Applications of Bloom Filters: A Sur- vey,

A. Broder and M. Mitzenmacher, “Network Applications of Bloom Filters: A Sur- vey,”Internet Mathematics, vol. 1, no. 4, pp. 485–509, 2005, doi: 10.1080/15427951. 2004.10129096

work page doi:10.1080/15427951 2005

[6] [6]

DGA Detection Using Similarity-Preserving Bloom Encod- ings,

L. Nitz and A. Mandal, “DGA Detection Using Similarity-Preserving Bloom Encod- ings,” inProc. European Interdisciplinary Cybersecurity Conference (EICC), Sta- vanger, Norway, June 14-15 2023, pp. 116-120, doi:10.1145/3590777.3590795

work page doi:10.1145/3590777.3590795 2023

[7] [7]

A blinded evaluation of privacy pre- serving record linkage with Bloom filters,

S. Randall, H. Wichmann, A. Brownet al., “A blinded evaluation of privacy pre- serving record linkage with Bloom filters,”BMC Medical Research Methodology, vol. 22, no. 1, Art. 22, Jan. 2022, doi: 10.1186/s12874-022-01510-2. 14 J. Cartmell et al

work page doi:10.1186/s12874-022-01510-2 2022

[8] [8]

SMS Spam Collection Dataset,

UCI Machine Learning Repository, “SMS Spam Collection Dataset,” Kaggle, 2011

work page 2011

[9] [9]

Olszewski,Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data, Ph.D

R. Olszewski,Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data, Ph.D. dissertation, Carnegie Mellon University, 2001

work page 2001

[10] [10]

UCI Machine Learning Repository: Adult Data Set,

D. Dua and C. Graff, “UCI Machine Learning Repository: Adult Data Set,” 1996

work page 1996

[11] [11]

Diabetes Health Indicators Dataset,

Centers for Disease Control and Prevention (CDC), “Diabetes Health Indicators Dataset,” 2020

work page 2020

[12] [12]

MNIST Handwritten Digit Database,

Y. LeCun and C. Cortes, “MNIST Handwritten Digit Database,” 1998

work page 1998

[13] [13]

Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms,

X. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms,” 2017

work page 2017

[14] [14]

Chen and C

T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” inPro- ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining (KDD), San Francisco, CA, USA, 2016, pp. 785-794, doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016

[15] [15]

W., et al., 2019, Science, DOI: 10.1126/sci- ence.aaw5903 Barsdell et al., 2010, MNRAS, 408,

G. E. Hinton and R. R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks,”Science, vol. 313, no. 5786, pp. 504–507, 2006, doi: 10.1126/sci- ence.1127647

work page doi:10.1126/sci- 2006

[16] [16]

ImageNet Classification with Deep Convolutional Neural Networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” inAdvances in Neural Information Processing Sys- tems (NeurIPS), vol. 25, 2012

work page 2012

[17] [17]

The Regression Analysis of Binary Sequences,

D. R. Cox, “The Regression Analysis of Binary Sequences,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 20, no. 2, pp. 215–242, 1958

work page 1958

[18] [18]

Invertible Bloom Lookup Tables with Less Memory and Randomness,

N. Fleischhacker, K. G. Larsen, M. Obremski, and M. Simkin, “Invertible Bloom Lookup Tables with Less Memory and Randomness,” inProc. 32nd Annu. Eur. Symp. Algorithms (ESA 2024), Leibniz Int. Proc. Informatics (LIPIcs), vol. 308, pp. 54:1–54:17, 2024, doi: 10.4230/LIPIcs.ESA.2024.54

work page doi:10.4230/lipics.esa.2024.54 2024

[19] [19]

Sampling and Reconstruction Using Bloom Filters

N. Sengupta, A. Bagchi, S. Bedathur, and M. Ramanath, “Sampling and Recon- struction Using Bloom Filters,”arXiv preprint arXiv:1701.03308, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

HMAC: Keyed-hashing for mes- sage authentication,

H. Krawczyk, M. Bellare, and R. Canetti, “HMAC: Keyed-hashing for mes- sage authentication,”RFC 2104, Internet Eng. Task Force (IETF), 1997, doi: 10.17487/RFC2104

work page doi:10.17487/rfc2104 1997

[21] [21]

Improved visualization of high-dimensional data using the distance-of-distances transformation,

J. Liu and M. Vinck, “Improved visualization of high-dimensional data using the distance-of-distances transformation,”Journal of Big Data, vol. 9, no. 1, Art. 72, 2022, doi: 10.1186/s40537-022-00525-5

work page doi:10.1186/s40537-022-00525-5 2022

[22] [22]

The use of multiple measurements in taxonomic problems,

R. A. Fisher, “The use of multiple measurements in taxonomic problems,”Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936

work page 1936

[23] [23]

I. T. Jolliffe,Principal Component Analysis, 2nd ed., Springer Series in Statistics. New York, NY, USA: Springer, 2002

work page 2002

[24] [24]

Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE),

F. Anowar, S. Sadaoui, and B. Selim, “Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE),”Computer Science Review, vol. 40, p. 100378, 2021, doi: 10.1016/j.cosrev.2021.100378

work page doi:10.1016/j.cosrev.2021.100378 2021

[25] [25]

Privacy- preserving record linkage using Bloom filters: A systematic literature review,

H. S. P. Vidanage, T. Ranbaduge, P. Christen, and D. Vatsalan, “Privacy- preserving record linkage using Bloom filters: A systematic literature review,” J. Inf. Secur. Appl., vol. 54, p. 102493, 2020, doi: 10.1016/j.jisa.2020.102493

work page doi:10.1016/j.jisa.2020.102493 2020

[26] [26]

Keyed-Hash Message Authentication Code (HMAC): Specification and Recommendations (Initial Public Draft),

M. Sönmez Turan and L. T. A. N. Brandão, “Keyed-Hash Message Authentication Code (HMAC): Specification and Recommendations (Initial Public Draft),” NIST Special Publication 800-224 (IPD), Jun. 2024

work page 2024