Space Upper Bounds for $\alpha$-Perfect Hashing

Emre Telatar; Ryan Song

arxiv: 2603.15251 · v2 · submitted 2026-03-16 · 💻 cs.IT · math.IT

Space Upper Bounds for α-Perfect Hashing

Ryan Song , Emre Telatar This is my paper

Pith reviewed 2026-05-15 10:14 UTC · model grok-4.3

classification 💻 cs.IT math.IT

keywords α-perfect hashingminimal perfect hashingspace upper boundssampling schemehash functionsexpected collisionsdescription length

0 comments

The pith

A sampling-based scheme constructs α-perfect hash functions with strictly less space than baseline randomization for every α in [0,1].

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies approximate perfect hashing where a randomized scheme is α-perfect if it produces a hash function with at most (1-α)k expected collisions on any fixed input set of size k. A simple baseline switches randomly between a perfect hash function and a zero-bit random hash function. The authors then introduce a sampling construction that meets the same collision bound while using fewer bits to describe the output hash function. This reduction holds for all α, including the endpoints where the problem reduces to perfect hashing or fully random hashing. Lower space per key makes approximate perfect hashing more practical for storing and retrieving large key sets.

Core claim

We present a sampling-based hashing scheme for α-perfect hash functions that significantly improves upon the space requirement of the baseline strategy for all values of α.

What carries the argument

A sampling procedure that selects a hash function whose expected number of collisions on any fixed k-set is at most (1-α)k while using strictly shorter description length than the baseline.

If this is right

Space per key is reduced relative to the baseline for every intermediate α between 0 and 1.
The same construction recovers the known k log e bit bound when α equals 1.
Space usage approaches zero bits per key as α approaches 0.
The scheme applies uniformly to any universe size n and set size k.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The sampling technique could be combined with existing perfect-hashing data structures to produce practical approximate dictionaries.
Similar sampling arguments might yield improved space bounds for other approximate hash families such as locality-sensitive hashing.
One could measure actual description lengths on concrete n and k to quantify the savings for specific α values.

Load-bearing premise

The sampling procedure produces a hash function whose expected collision count on any fixed set A of size k is at most (1-α)k while using strictly less description length than the baseline for every α.

What would settle it

An explicit calculation or lower-bound proof showing that for some α there is no hash function meeting the expected-collision bound whose description length is shorter than the baseline length.

Figures

Figures reproduced from arXiv: 2603.15251 by Emre Telatar, Ryan Song.

**Figure 1.** Figure 1: Upper bounds on the optimal amortized representation length R∗(α). A. Main Results This paper proposes and analyzes the space requirements of two different α-perfect hashing schemes. The first scheme randomizes between minimal perfect hashing and zero-bit hashing to achieve a space requirement of λ(α) log(e) bits per key in expectation, where λ(α) = max α − 1/e 1 − 1/e , 0 . (1) This essentially achiev… view at source ↗

read the original abstract

In the problem of minimal perfect hashing, we are given a size $k$ subset $\mathcal{A}$ of a universe of keys $[n] = \{1,2, \cdots, n\}$, for which we wish to construct a hash function $h: [n] \to [k]$ such that $h(\cdot)$ maps $\mathcal{A}$ to $[k]$ with no collisions, i.e., the restriction of $h(\cdot)$ to $\mathcal{A}$ is injective. In this paper, we extend the study of minimal perfect hashing to the approximate setting. For an $\alpha \in [0, 1]$, we say that a randomized hashing scheme is $\alpha$-perfect if for any input $\mathcal{A}$ of size $k$, it outputs a hash function which exhibits at most $(1-\alpha)k$ collisions on $\mathcal{A}$ in expectation. One important performance consideration for any hashing scheme is the space required to store the hash functions. For minimal perfect hashing, it is well known that approximately $k\log(e)$ bits, or $\log(e)$ bits per key, is required to store the hash function. In this paper, we propose schemes for constructing minimal $\alpha$-perfect hash functions and analyze their space requirements. We begin by presenting a simple base-line scheme which randomizes between perfect hashing and zero-bit random hashing. We then present a more sophisticated hashing scheme based on sampling which significantly improves upon the space requirement of the aforementioned strategy for all values of $\alpha$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The sampling scheme for α-perfect hashing beats the baseline space bound for all α.

read the letter

The paper gives a sampling-based scheme for α-perfect hashing that uses less space than the baseline randomization for every value of α. They define α-perfect as a scheme where any k-set gets at most (1-α)k collisions in expectation. The baseline mixes a full perfect hash with a zero-description random hash to hit the bound at expected space α k log e bits. The sampling method improves this space figure while keeping the expectation. This is new work in the hashing literature, and the high-level construction matches the requirements without circularity or obvious flaws in the tradeoff. The space savings are the main contribution. It is relevant for anyone working on space-efficient data structures that can tolerate some collisions. The math appears grounded in standard probabilistic arguments for hashing. I would bring this to the next reading group to go over the sampling details and the exact space expressions. It is worth citing for the improved bound. The authors engage honestly with the prior perfect hashing results, so the paper deserves a serious referee.

Referee Report

0 major / 2 minor

Summary. The paper defines α-perfect hashing as a randomized scheme that, for any k-set A, outputs a hash function h:[n]→[k] with at most (1-α)k expected collisions on A. It gives a baseline construction that randomizes between a minimal perfect hash (≈k log e bits) and a zero-bit random hash to achieve expected space α k log e, then presents a sampling-based scheme claimed to use strictly less description length while preserving the same expected-collision bound for every α∈[0,1].

Significance. The sampling construction supplies an explicit, parameter-free improvement to the space upper bound for α-perfect hashing across the full range of α. If the collision-probability analysis is correct, the result strengthens the known space-collision tradeoff for approximate perfect hashing and supplies concrete, reproducible constructions that could be useful in data-structure design.

minor comments (2)

[Abstract] Abstract: the claim that the sampling scheme 'significantly improves' space is stated without a quantitative expression or ratio relative to the baseline α k log e; adding a brief formula (e.g., the achieved space as a function of α) would let readers immediately gauge the gain.
[§3 (Sampling Scheme)] The sampling procedure is described at a high level; the exact distribution from which the hash functions are sampled (and the precise calculation showing E[collisions] ≤ (1-α)k) should be written out with the same rigor used for the baseline, preferably in a dedicated subsection with an equation for the description length.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation for minor revision. The referee's summary correctly describes our baseline randomization construction and the sampling-based scheme that improves the space upper bound for α-perfect hashing across all α ∈ [0,1].

Circularity Check

0 steps flagged

No circularity: explicit constructions and space analysis are self-contained

full rationale

The paper defines α-perfect hashing directly from the collision expectation bound, then gives an explicit baseline (randomization between perfect hashing and zero-bit hashing) whose space is α k log e by direct calculation, followed by a sampling construction whose space improvement is derived from the sampling procedure's collision guarantee. No equation reduces to a fitted parameter renamed as prediction, no self-citation is load-bearing, and no ansatz or uniqueness claim is imported from prior work by the same authors. The derivation chain is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard probabilistic analysis of hashing collisions and expectation; no free parameters, invented entities, or non-standard axioms are visible.

axioms (1)

standard math Standard probabilistic method for bounding expected collisions under random hashing
Used to define the α-perfect property and to compare the baseline and sampling schemes.

pith-pipeline@v0.9.0 · 5567 in / 1152 out tokens · 41159 ms · 2026-05-15T10:14:36.104213+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

randomizes between minimal perfect hashing and zero-bit hashing... sampling-based... Poisson Function Representation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Modern minimal perfect hashing: A survey,

H.-P. Lehmann, T. Mueller, R. Pagh, G. E. Pibiri, P. Sanders, S. Vigna, and S. Walzer, “Modern minimal perfect hashing: A survey,” 2026. [Online]. Available: https://arxiv.org/abs/2506.06536

work page arXiv 2026
[2]

Development of the alohanet,

N. Abramson, “Development of the alohanet,”IEEE Trans. Inf. Theory, vol. 31, no. 2, pp. 119–123, 1985

work page 1985
[3]

Minimum feedback for collision-free scheduling in massive random access,

J. Kang and W. Yu, “Minimum feedback for collision-free scheduling in massive random access,”IEEE Trans. Inf. Theory, vol. 67, no. 12, pp. 8094–8108, 2021

work page 2021
[4]

Scheduling versus contention for massive random access in massive mimo systems,

——, “Scheduling versus contention for massive random access in massive mimo systems,”IEEE Trans. Commun., vol. 70, no. 9, pp. 5811– 5824, 2022

work page 2022
[5]

Coded downlink massive random access and a finite de finetti theorem,

R. Song, K. M. Attiah, and W. Yu, “Coded downlink massive random access and a finite de finetti theorem,”IEEE Trans. Inf. Theory, vol. 71, no. 9, pp. 6932–6949, 2025

work page 2025
[6]

Downlink massive random access with lossy source coding,

R. Song and W. Yu, “Downlink massive random access with lossy source coding,” in2025 IEEE International Symposium on Information Theory (ISIT), 2025, pp. 1–6

work page 2025
[7]

On the size of separating systems and families of perfect hash functions,

M. L. Fredman and J. Komlós, “On the size of separating systems and families of perfect hash functions,”SIAM Journal on Algebraic Discrete Methods, vol. 5, no. 1, pp. 61–68, 1984

work page 1984
[8]

Fredman–Komlós bounds and information theory,

J. Körner, “Fredman–Komlós bounds and information theory,”SIAM Journal on Algebraic Discrete Methods, vol. 7, no. 4, pp. 560–570, 1986

work page 1986
[9]

New bounds for perfect hashing via infor- mation theory,

J. Korner and K. Marton, “New bounds for perfect hashing via infor- mation theory,”European Journal of Combinatorics, vol. 9, no. 6, pp. 523–530, 1988

work page 1988
[10]

Hash, displace, and compress,

D. Belazzougui, F. C. Botelho, and M. Dietzfelbinger, “Hash, displace, and compress,” inAlgorithms - ESA 2009, A. Fiat and P. Sanders, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 682–693

work page 2009
[11]

Shockhash: Towards optimal-space minimal perfect hashing beyond brute-force,

H.-P. Lehmann, P. Sanders, and S. Walzer, “Shockhash: Towards optimal-space minimal perfect hashing beyond brute-force,” in2024 Proceedings of the Symposium on Algorithm Engineering and Exper- iments (ALENEX), 2024, pp. 194–206

work page 2024
[12]

Strong functional representation lemma and applications to coding theorems,

C. T. Li and A. El Gamal, “Strong functional representation lemma and applications to coding theorems,”IEEE Trans. Inf. Theory, vol. 64, no. 11, pp. 6967–6978, 2018

work page 2018
[13]

The com- munication complexity of correlation,

P. Harsha, R. Jain, D. McAllester, and J. Radhakrishnan, “The com- munication complexity of correlation,” inTwenty-Second Annual IEEE Conference on Computational Complexity (CCC’07), 2007, pp. 10–23

work page 2007

[1] [1]

Modern minimal perfect hashing: A survey,

H.-P. Lehmann, T. Mueller, R. Pagh, G. E. Pibiri, P. Sanders, S. Vigna, and S. Walzer, “Modern minimal perfect hashing: A survey,” 2026. [Online]. Available: https://arxiv.org/abs/2506.06536

work page arXiv 2026

[2] [2]

Development of the alohanet,

N. Abramson, “Development of the alohanet,”IEEE Trans. Inf. Theory, vol. 31, no. 2, pp. 119–123, 1985

work page 1985

[3] [3]

Minimum feedback for collision-free scheduling in massive random access,

J. Kang and W. Yu, “Minimum feedback for collision-free scheduling in massive random access,”IEEE Trans. Inf. Theory, vol. 67, no. 12, pp. 8094–8108, 2021

work page 2021

[4] [4]

Scheduling versus contention for massive random access in massive mimo systems,

——, “Scheduling versus contention for massive random access in massive mimo systems,”IEEE Trans. Commun., vol. 70, no. 9, pp. 5811– 5824, 2022

work page 2022

[5] [5]

Coded downlink massive random access and a finite de finetti theorem,

R. Song, K. M. Attiah, and W. Yu, “Coded downlink massive random access and a finite de finetti theorem,”IEEE Trans. Inf. Theory, vol. 71, no. 9, pp. 6932–6949, 2025

work page 2025

[6] [6]

Downlink massive random access with lossy source coding,

R. Song and W. Yu, “Downlink massive random access with lossy source coding,” in2025 IEEE International Symposium on Information Theory (ISIT), 2025, pp. 1–6

work page 2025

[7] [7]

On the size of separating systems and families of perfect hash functions,

M. L. Fredman and J. Komlós, “On the size of separating systems and families of perfect hash functions,”SIAM Journal on Algebraic Discrete Methods, vol. 5, no. 1, pp. 61–68, 1984

work page 1984

[8] [8]

Fredman–Komlós bounds and information theory,

J. Körner, “Fredman–Komlós bounds and information theory,”SIAM Journal on Algebraic Discrete Methods, vol. 7, no. 4, pp. 560–570, 1986

work page 1986

[9] [9]

New bounds for perfect hashing via infor- mation theory,

J. Korner and K. Marton, “New bounds for perfect hashing via infor- mation theory,”European Journal of Combinatorics, vol. 9, no. 6, pp. 523–530, 1988

work page 1988

[10] [10]

Hash, displace, and compress,

D. Belazzougui, F. C. Botelho, and M. Dietzfelbinger, “Hash, displace, and compress,” inAlgorithms - ESA 2009, A. Fiat and P. Sanders, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 682–693

work page 2009

[11] [11]

Shockhash: Towards optimal-space minimal perfect hashing beyond brute-force,

H.-P. Lehmann, P. Sanders, and S. Walzer, “Shockhash: Towards optimal-space minimal perfect hashing beyond brute-force,” in2024 Proceedings of the Symposium on Algorithm Engineering and Exper- iments (ALENEX), 2024, pp. 194–206

work page 2024

[12] [12]

Strong functional representation lemma and applications to coding theorems,

C. T. Li and A. El Gamal, “Strong functional representation lemma and applications to coding theorems,”IEEE Trans. Inf. Theory, vol. 64, no. 11, pp. 6967–6978, 2018

work page 2018

[13] [13]

The com- munication complexity of correlation,

P. Harsha, R. Jain, D. McAllester, and J. Radhakrishnan, “The com- munication complexity of correlation,” inTwenty-Second Annual IEEE Conference on Computational Complexity (CCC’07), 2007, pp. 10–23

work page 2007