Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing
Pith reviewed 2026-05-16 08:33 UTC · model grok-4.3
The pith
Even with perfect data and training, limited capacity forces LLMs to assign high confidence to some non-facts
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The information-theoretically optimal strategy for membership testing under limited capacity, when facts are sparse, is characterized by the minimum KL divergence between the score distributions on facts and non-facts; this value is positive, so the optimal policy assigns high to some non-facts, which manifests as hallucination.
What carries the argument
The rate-distortion function for membership testing, defined as the minimum KL divergence between score distributions on facts and non-facts under a capacity (rate) constraint.
If this is right
- Hallucinations remain even with perfect training data and optimal optimization in a closed world.
- Abstaining or responding with low is not always capacity-optimal compared with confident errors on non-facts.
- The same rate-distortion tradeoff unifies discrete set-membership errors with continuous probabilistic losses.
- Increasing capacity reduces but does not eliminate the forced error rate unless the fact density changes.
Where Pith is reading between the lines
- Retrieval-augmented systems could lower the effective sparsity and thereby reduce the predicted hallucination rate.
- The same tradeoff may appear in any learning system that compresses a sparse set of positive examples.
- Measuring the empirical KL divergence between fact and non-fact scores on real models could test the predicted lower bound.
Load-bearing premise
Facts are sparse relative to the space of all plausible claims, so that perfect separation is not the most efficient use of limited memory.
What would settle it
A concrete strategy or trained model that achieves lower average log-loss than the predicted minimum KL divergence at the same capacity, or that reaches zero distortion on non-facts without using more bits.
Figures
read the original abstract
Large language models often hallucinate with high confidence on "random facts" that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination: even with optimal training, perfect data, and a simplified "closed world" setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on synthetic data, showing that hallucinations persist as a natural consequence of lossy compression.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that hallucinations in LLMs arise as an inevitable consequence of space-optimal memorization of sparse facts. It formalizes the problem as membership testing, unifies Bloom-filter false-positive rates with continuous log-loss via a rate-distortion objective, and proves that the capacity-constrained optimum is characterized by the minimum KL divergence between score distributions on facts versus non-facts. This optimum forces high-confidence assignment to some non-facts rather than abstention. The claim is supported by a synthetic-data validation showing persistent hallucinations under optimal compression in a closed-world sparse regime.
Significance. If the central theorem is correct, the work supplies a clean information-theoretic explanation for why hallucinations survive even perfect data and optimal training, shifting focus from training artifacts to fundamental capacity limits. The unification of discrete set-membership distortion with log-loss is technically interesting and could inform capacity-aware evaluation protocols.
major comments (3)
- [Section 3] The rate-distortion theorem (Section 3): the manuscript states that the optimal memory is characterized by min KL between fact and non-fact score distributions, yet supplies neither the derivation of the rate-distortion function nor the explicit optimization that shows why the minimizer must place high mass on non-facts. Without these steps it is impossible to confirm that the claimed behavior follows from the stated objective rather than from the modeling assumptions.
- [Section 2] Problem formulation (Section 2): the result that abstention is suboptimal holds only under the joint assumptions of discrete sparse claims, symmetric log-loss distortion, and absence of inferable structure. No sensitivity analysis or alternative distortion is examined; relaxing any of these (as noted in the skeptic analysis) allows low-confidence or abstention solutions to achieve the same rate, undermining the generality of the hallucination explanation.
- [Section 5] Empirical validation (Section 5): the synthetic experiments are reported to confirm the theory inside the assumed regime, but the text provides neither the precise experimental protocol (number of trials, exact capacity values, statistical controls) nor quantitative baselines that would demonstrate the hallucinations are produced by the rate-distortion optimum rather than by implementation details.
minor comments (2)
- [Section 2] Notation for the unified distortion measure is introduced without an explicit equation linking the Bloom-filter error probability to the log-loss term; adding this would clarify the unification.
- The abstract and introduction cite the closed-world assumption but do not reference prior rate-distortion analyses of memorization or Bloom-filter literature in learning theory.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate the requested clarifications, derivations, and experimental details while preserving the core claims.
read point-by-point responses
-
Referee: [Section 3] The rate-distortion theorem (Section 3): the manuscript states that the optimal memory is characterized by min KL between fact and non-fact score distributions, yet supplies neither the derivation of the rate-distortion function nor the explicit optimization that shows why the minimizer must place high mass on non-facts. Without these steps it is impossible to confirm that the claimed behavior follows from the stated objective rather than from the modeling assumptions.
Authors: We agree that the derivation steps require more explicit presentation. In the revised version we will add a dedicated subsection deriving the rate-distortion function from the mutual-information rate constraint and the expected symmetric log-loss distortion. The optimization will be shown to yield a minimizer that places positive mass on non-facts because any abstention strategy increases the required rate to keep distortion below the target threshold; the minimum KL characterization follows directly from the Lagrange dual of the constrained problem. revision: yes
-
Referee: [Section 2] Problem formulation (Section 2): the result that abstention is suboptimal holds only under the joint assumptions of discrete sparse claims, symmetric log-loss distortion, and absence of inferable structure. No sensitivity analysis or alternative distortion is examined; relaxing any of these (as noted in the skeptic analysis) allows low-confidence or abstention solutions to achieve the same rate, undermining the generality of the hallucination explanation.
Authors: The assumptions are stated explicitly in Section 2 as the modeling choices that isolate the sparse, unstructured regime relevant to random-fact hallucinations. We will revise the text to include a short sensitivity paragraph clarifying that the sub-optimality of abstention is specific to this regime and that structured or inferable facts fall outside the theorem’s scope. While a comprehensive sensitivity study lies beyond the present scope, we will note that alternative distortions permitting abstention correspond to different problem settings with higher effective capacity requirements. revision: partial
-
Referee: [Section 5] Empirical validation (Section 5): the synthetic experiments are reported to confirm the theory inside the assumed regime, but the text provides neither the precise experimental protocol (number of trials, exact capacity values, statistical controls) nor quantitative baselines that would demonstrate the hallucinations are produced by the rate-distortion optimum rather than by implementation details.
Authors: We will expand Section 5 with the missing protocol details: 100 independent trials, capacity values of 0.1–0.5 bits per fact, bootstrap confidence intervals, and two quantitative baselines (uniform random scoring and a simple threshold classifier). These additions will show that the observed hallucination rates match the theoretical minimum-KL predictions rather than arising from implementation artifacts. revision: yes
Circularity Check
No circularity: rate-distortion optimum follows directly from stated distortion and sparsity assumptions
full rationale
The derivation begins with an explicit rate-distortion formulation for membership testing that unifies Bloom-filter error with log-loss. The optimal strategy is characterized as the minimum KL divergence between score distributions on facts versus non-facts; under the paper's stated regime of sparse facts in a closed discrete universe, this minimum is achieved by assigning high confidence to some non-facts. This is a mathematical consequence of the chosen distortion measure and capacity constraint, not a redefinition or fit. No self-citations, fitted parameters, or ansatzes are invoked to close the argument, and the synthetic experiments simply instantiate the same objective. The result is therefore self-contained against the stated modeling choices.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Facts are sparse in the universe of plausible claims
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the minimum memory budget per key is characterized by the minimum KL-divergence between key and non-key output distributions... KL(µK∥µN)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
unique loss-minimizing query output... assign high confidence to all facts, while assigning either zero or fact-level confidence to non-facts
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Llms will always hallucinate, and we need to live with this,
URLhttps://arxiv.org/abs/2510.06265. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URLhttps://openreview.net/forum?id= 5x788rqbcj. Zeyuan Allen-Zhu and Yuanzhi Li. Physic...
-
[2]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
doi: 10.1145/3442188.3445922. URLhttps://doi.org/10.1145/3442188.3445922. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors.Commun. ACM, 13(7): 422–426, 1970. doi: 10.1145/362686.362692. URLhttps://doi.org/10.1145/362686.362692. Avrim Blum and John Langford. Pac-mdl bounds. InLearning Theory and Kernel Machines: 16th Annual Confe...
-
[3]
URLhttps://arxiv.org/abs/2507.22915. Fredrik Hellstr"om, Giuseppe Durisi, Benjamin Guedj, and Maxim Raginsky. Generalization bounds: Perspectives from information theory and pac-bayes.Foundations and Trends in Machine Learning, 18(1):1–223, 2025. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. I...
-
[4]
doi: 10.1007/3-540-44676-1\_25. URLhttps://doi.org/10.1007/3-540-44676-1_25. Zhixuan Pan, Shaowen Wang, and Jian Li. Understanding LLM behaviors via compression: Data generation, knowledge acquisition and scaling laws.CoRR, abs/2504.09597, 2025. doi: 10.48550/ ARXIV.2504.09597. URLhttps://doi.org/10.48550/arXiv.2504.09597. 20 Michal Perelkiewicz and Rafal...
-
[5]
2.F p(µK, µN)is continuous inp
The function(p, µK, µN)7→F p(µK, µN)is jointly lower semi-continuous. 2.F p(µK, µN)is continuous inp. 3.F p is differentiable inpwith ∂ ∂p Fp(µK, µN) =− KL(µN ∥pµK+(1−p)µN) p2 wheneverp >0. Proof.We prove the three properties using the following identity: I(X; ˆX) = KL(P X, ˆX ∥PX ⊗P ˆX) =E X∼Bern(p) KL(P ˆX|X ∥P ˆX) =pKL(µ K∥µp) + (1−p)KL(µ N ∥µp),(2) wh...
-
[6]
Joint Lower Semi-Continuity.For any sequence( pn, µK,n, µN,n) → (p, µK, µN), we show that lim infF pn ≥F p. First, observe that the map( p, µK, µN) 7→µ p = pµK + (1 −p )µN is continuous from the product topology to the weak-* topology. Since KL divergence is jointly lower semi-continuous (LSC) with respect to the weak-* topology, it is immediate thatFp(µK...
-
[7]
We only need to verify continuity atp = 0
Continuity inp.For fixed µK, µN, and p > 0, Fp is clearly continuous as a composition of continuous functions. We only need to verify continuity atp = 0. Standard results in information theory have shown that the KL divergence vanishes sublinearly when the two distributions are close, see e.g. Proposition 2.20 in (Polyanskiy and Wu, 2025): ∂ ∂p KL(µN ∥µp)...
work page 2025
-
[8]
Differentiability in p.Fix µK, µN. For p∈ (0, 1), let ν := µK + µN and write fK := dµK dν , fN := dµN dν , so that the mixtureµp =pµ K + (1−p)µ N has density mp := dµp dν =pf K + (1−p)f N . Using (2), we can write I(p) :=I(X; ˆX) =p Z fK log fK mp dν+ (1−p) Z fN log fN mp dν. We first show thatI ′(p) = KL(µK∥µp) −KL (µN ∥µp). For convenience, we assume th...
work page 2011
-
[9]
Let ˆPX,Y (x′, y′|xu, yu)be the joint empirical distribution ofxi and yi for i = 1, . . . , uin the sequencesx u andy u. Then, ˆPX,Y (x′, y′|xu, yu)−P X(x′)PY|X (y′|x′) ≤γ
-
[10]
WheneverP Y|X (y′|x′) = 0, ˆPX,Y (x′, y′|xu, yu) = 0. It follows from uniform continuity that, for sufficiently smallγ, all P ∗ Y|X -typical sequences yu under condition xu will have the empirical distribution of(xi, yi)close to P ∗ X,Y, and thus satisfy(4) and (5) for the desired choice ofδ. By Lemma 2.13 in Csiszár and Körner (2011), there exist sequenc...
work page 2011
-
[11]
If the minimum is uniquely atx = 0. Then µ∗ N = δ0. This implies C∗ = 0 λK = 0(since λK >1). This leads toJ(µ ∗ N) =∞, which is not optimal
-
[12]
If the minimum is uniquely atx∗. Then µ∗ N = δx∗. This implies µ∗ K = µ∗ N and KL = 0, a contradiction. We conclude thatµ∗ N is a two-point distribution: µ∗ N = (1−q ∗)δ0 +q ∗δx∗. We can now determineµ∗ K using the relative density in(7): because dµ∗ K dµ∗ N (0) = 0λK = 0, it follows that it is a point massµ∗ K =δ x∗. Now we use the tight constraints to d...
-
[13]
Constraint onµ ∗ K:E µ∗ K[−lnX] =ε K =⇒ −ln(x ∗) =ε K =⇒x ∗ =e −εK
-
[14]
(1−q ∗)(−ln 1) +q ∗(−ln(1−x ∗)) =ε N =⇒q ∗ = εN −ln(1−x ∗)
Constraint onµ ∗ N:E µ∗ N [−ln(1−X)] =ε N. (1−q ∗)(−ln 1) +q ∗(−ln(1−x ∗)) =ε N =⇒q ∗ = εN −ln(1−x ∗) . The conditione −εK +e −εN >1ensures thatε N <−ln(1−e −εK), so0< q ∗ <1. Step 4: verifying KKT conditions.Note that we calculate C∗ = Eµ∗ N [X λK] = q∗(x∗)λK (since λK >1). Condition 1:g(x ∗) = 0. g(x∗) =− (x∗)λK C∗ −λ N ln(1−x ∗) =− 1 q∗ −λ N ln(1−x ∗) ...
-
[15]
(Constraints preserved)E ¯µK[1− ˆX] =E µK[1− ˆX]andE ¯µN [ ˆX] =E µN [ ˆX]
-
[16]
(Objective non-increasing) For everyp∈(0,1), Fp(¯µK,¯µN)≤F p(µK, µN). Consequently, in the definition ofRp(εK, εN)we may restrict toˆx∈ {0,1}, and the optimal output distributions are Bernoulli. Proof sketch.The first part follows from linearity: underT, the binary outputY∈ { 0, 1} satisfies E[Y| ˆX = t] = t, hence E[Y ] = E[ ˆX]and E[1 −Y ] = E[1 − ˆX]. ...
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.