pith. sign in

arxiv: 2602.00906 · v6 · submitted 2026-01-31 · 💻 cs.LG · cs.AI· cs.CL· cs.DS· cs.IT· math.IT

Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

Pith reviewed 2026-05-16 08:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.DScs.ITmath.IT
keywords hallucinationrate-distortionmembership testingBloom filterslog-losslarge language modelsinformation theorysparse facts
0
0 comments X

The pith

Even with perfect data and training, limited capacity forces LLMs to assign high confidence to some non-facts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models the memorization of sparse facts as a membership testing problem that unifies Bloom filter error rates with the log-loss used to train language models. It derives a rate-distortion theorem showing that the lowest achievable distortion under a capacity constraint equals the minimum KL divergence between score distributions on facts versus non-facts. In the sparse regime this minimum is strictly positive, so the optimal strategy necessarily includes confident errors on some non-facts. Those errors are precisely what appear as hallucinations. Synthetic experiments confirm that the predicted errors persist even when training is optimal and the world is closed.

Core claim

The information-theoretically optimal strategy for membership testing under limited capacity, when facts are sparse, is characterized by the minimum KL divergence between the score distributions on facts and non-facts; this value is positive, so the optimal policy assigns high to some non-facts, which manifests as hallucination.

What carries the argument

The rate-distortion function for membership testing, defined as the minimum KL divergence between score distributions on facts and non-facts under a capacity (rate) constraint.

If this is right

  • Hallucinations remain even with perfect training data and optimal optimization in a closed world.
  • Abstaining or responding with low is not always capacity-optimal compared with confident errors on non-facts.
  • The same rate-distortion tradeoff unifies discrete set-membership errors with continuous probabilistic losses.
  • Increasing capacity reduces but does not eliminate the forced error rate unless the fact density changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Retrieval-augmented systems could lower the effective sparsity and thereby reduce the predicted hallucination rate.
  • The same tradeoff may appear in any learning system that compresses a sparse set of positive examples.
  • Measuring the empirical KL divergence between fact and non-fact scores on real models could test the predicted lower bound.

Load-bearing premise

Facts are sparse relative to the space of all plausible claims, so that perfect separation is not the most efficient use of limited memory.

What would settle it

A concrete strategy or trained model that achieves lower average log-loss than the predicted minimum KL divergence at the same capacity, or that reaches zero distortion on non-facts without using more bits.

Figures

Figures reproduced from arXiv: 2602.00906 by Anxin Guo, Jingwei Li.

Figure 1
Figure 1. Figure 1: Output distributions on facts vs. non-facts ( [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effect of different weight λF on fact. Each color represents a different model size. The amount of information per key decreases with λF for each model size. et al., 2024). Theorem 4.3 is agnostic to how xˆ is produced. Whether the decision is derived from a calibrated probability, a raw logit, a log-likelihood ratio, or a multi-stage verification chain that outputs a scalar score, the final thresholded ma… view at source ↗
Figure 3
Figure 3. Figure 3: Output distributions with 15145 facts and 8767 parameters. [PITH_FULL_IMAGE:figures/full_fig_p044_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Output distributions with 15145 facts and 33085 parameters. [PITH_FULL_IMAGE:figures/full_fig_p044_4.png] view at source ↗
read the original abstract

Large language models often hallucinate with high confidence on "random facts" that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination: even with optimal training, perfect data, and a simplified "closed world" setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on synthetic data, showing that hallucinations persist as a natural consequence of lossy compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that hallucinations in LLMs arise as an inevitable consequence of space-optimal memorization of sparse facts. It formalizes the problem as membership testing, unifies Bloom-filter false-positive rates with continuous log-loss via a rate-distortion objective, and proves that the capacity-constrained optimum is characterized by the minimum KL divergence between score distributions on facts versus non-facts. This optimum forces high-confidence assignment to some non-facts rather than abstention. The claim is supported by a synthetic-data validation showing persistent hallucinations under optimal compression in a closed-world sparse regime.

Significance. If the central theorem is correct, the work supplies a clean information-theoretic explanation for why hallucinations survive even perfect data and optimal training, shifting focus from training artifacts to fundamental capacity limits. The unification of discrete set-membership distortion with log-loss is technically interesting and could inform capacity-aware evaluation protocols.

major comments (3)
  1. [Section 3] The rate-distortion theorem (Section 3): the manuscript states that the optimal memory is characterized by min KL between fact and non-fact score distributions, yet supplies neither the derivation of the rate-distortion function nor the explicit optimization that shows why the minimizer must place high mass on non-facts. Without these steps it is impossible to confirm that the claimed behavior follows from the stated objective rather than from the modeling assumptions.
  2. [Section 2] Problem formulation (Section 2): the result that abstention is suboptimal holds only under the joint assumptions of discrete sparse claims, symmetric log-loss distortion, and absence of inferable structure. No sensitivity analysis or alternative distortion is examined; relaxing any of these (as noted in the skeptic analysis) allows low-confidence or abstention solutions to achieve the same rate, undermining the generality of the hallucination explanation.
  3. [Section 5] Empirical validation (Section 5): the synthetic experiments are reported to confirm the theory inside the assumed regime, but the text provides neither the precise experimental protocol (number of trials, exact capacity values, statistical controls) nor quantitative baselines that would demonstrate the hallucinations are produced by the rate-distortion optimum rather than by implementation details.
minor comments (2)
  1. [Section 2] Notation for the unified distortion measure is introduced without an explicit equation linking the Bloom-filter error probability to the log-loss term; adding this would clarify the unification.
  2. The abstract and introduction cite the closed-world assumption but do not reference prior rate-distortion analyses of memorization or Bloom-filter literature in learning theory.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate the requested clarifications, derivations, and experimental details while preserving the core claims.

read point-by-point responses
  1. Referee: [Section 3] The rate-distortion theorem (Section 3): the manuscript states that the optimal memory is characterized by min KL between fact and non-fact score distributions, yet supplies neither the derivation of the rate-distortion function nor the explicit optimization that shows why the minimizer must place high mass on non-facts. Without these steps it is impossible to confirm that the claimed behavior follows from the stated objective rather than from the modeling assumptions.

    Authors: We agree that the derivation steps require more explicit presentation. In the revised version we will add a dedicated subsection deriving the rate-distortion function from the mutual-information rate constraint and the expected symmetric log-loss distortion. The optimization will be shown to yield a minimizer that places positive mass on non-facts because any abstention strategy increases the required rate to keep distortion below the target threshold; the minimum KL characterization follows directly from the Lagrange dual of the constrained problem. revision: yes

  2. Referee: [Section 2] Problem formulation (Section 2): the result that abstention is suboptimal holds only under the joint assumptions of discrete sparse claims, symmetric log-loss distortion, and absence of inferable structure. No sensitivity analysis or alternative distortion is examined; relaxing any of these (as noted in the skeptic analysis) allows low-confidence or abstention solutions to achieve the same rate, undermining the generality of the hallucination explanation.

    Authors: The assumptions are stated explicitly in Section 2 as the modeling choices that isolate the sparse, unstructured regime relevant to random-fact hallucinations. We will revise the text to include a short sensitivity paragraph clarifying that the sub-optimality of abstention is specific to this regime and that structured or inferable facts fall outside the theorem’s scope. While a comprehensive sensitivity study lies beyond the present scope, we will note that alternative distortions permitting abstention correspond to different problem settings with higher effective capacity requirements. revision: partial

  3. Referee: [Section 5] Empirical validation (Section 5): the synthetic experiments are reported to confirm the theory inside the assumed regime, but the text provides neither the precise experimental protocol (number of trials, exact capacity values, statistical controls) nor quantitative baselines that would demonstrate the hallucinations are produced by the rate-distortion optimum rather than by implementation details.

    Authors: We will expand Section 5 with the missing protocol details: 100 independent trials, capacity values of 0.1–0.5 bits per fact, bootstrap confidence intervals, and two quantitative baselines (uniform random scoring and a simple threshold classifier). These additions will show that the observed hallucination rates match the theoretical minimum-KL predictions rather than arising from implementation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: rate-distortion optimum follows directly from stated distortion and sparsity assumptions

full rationale

The derivation begins with an explicit rate-distortion formulation for membership testing that unifies Bloom-filter error with log-loss. The optimal strategy is characterized as the minimum KL divergence between score distributions on facts versus non-facts; under the paper's stated regime of sparse facts in a closed discrete universe, this minimum is achieved by assigning high confidence to some non-facts. This is a mathematical consequence of the chosen distortion measure and capacity constraint, not a redefinition or fit. No self-citations, fitted parameters, or ansatzes are invoked to close the argument, and the synthetic experiments simply instantiate the same objective. The result is therefore self-contained against the stated modeling choices.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that facts are sparse and on the modeling decision to treat discrete set-membership error and continuous log-loss as instances of the same rate-distortion problem.

axioms (1)
  • domain assumption Facts are sparse in the universe of plausible claims
    Invoked to define the operating regime in which the rate-distortion characterization is derived.

pith-pipeline@v0.9.0 · 5473 in / 1272 out tokens · 35853 ms · 2026-05-16T08:33:35.435664+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    Llms will always hallucinate, and we need to live with this,

    URLhttps://arxiv.org/abs/2510.06265. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URLhttps://openreview.net/forum?id= 5x788rqbcj. Zeyuan Allen-Zhu and Yuanzhi Li. Physic...

  2. [2]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    doi: 10.1145/3442188.3445922. URLhttps://doi.org/10.1145/3442188.3445922. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors.Commun. ACM, 13(7): 422–426, 1970. doi: 10.1145/362686.362692. URLhttps://doi.org/10.1145/362686.362692. Avrim Blum and John Langford. Pac-mdl bounds. InLearning Theory and Kernel Machines: 16th Annual Confe...

  3. [3]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

    URLhttps://arxiv.org/abs/2507.22915. Fredrik Hellstr"om, Giuseppe Durisi, Benjamin Guedj, and Maxim Raginsky. Generalization bounds: Perspectives from information theory and pac-bayes.Foundations and Trends in Machine Learning, 18(1):1–223, 2025. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. I...

  4. [4]

    world state

    doi: 10.1007/3-540-44676-1\_25. URLhttps://doi.org/10.1007/3-540-44676-1_25. Zhixuan Pan, Shaowen Wang, and Jian Li. Understanding LLM behaviors via compression: Data generation, knowledge acquisition and scaling laws.CoRR, abs/2504.09597, 2025. doi: 10.48550/ ARXIV.2504.09597. URLhttps://doi.org/10.48550/arXiv.2504.09597. 20 Michal Perelkiewicz and Rafal...

  5. [5]

    2.F p(µK, µN)is continuous inp

    The function(p, µK, µN)7→F p(µK, µN)is jointly lower semi-continuous. 2.F p(µK, µN)is continuous inp. 3.F p is differentiable inpwith ∂ ∂p Fp(µK, µN) =− KL(µN ∥pµK+(1−p)µN) p2 wheneverp >0. Proof.We prove the three properties using the following identity: I(X; ˆX) = KL(P X, ˆX ∥PX ⊗P ˆX) =E X∼Bern(p) KL(P ˆX|X ∥P ˆX) =pKL(µ K∥µp) + (1−p)KL(µ N ∥µp),(2) wh...

  6. [6]

    First, observe that the map( p, µK, µN) 7→µ p = pµK + (1 −p )µN is continuous from the product topology to the weak-* topology

    Joint Lower Semi-Continuity.For any sequence( pn, µK,n, µN,n) → (p, µK, µN), we show that lim infF pn ≥F p. First, observe that the map( p, µK, µN) 7→µ p = pµK + (1 −p )µN is continuous from the product topology to the weak-* topology. Since KL divergence is jointly lower semi-continuous (LSC) with respect to the weak-* topology, it is immediate thatFp(µK...

  7. [7]

    We only need to verify continuity atp = 0

    Continuity inp.For fixed µK, µN, and p > 0, Fp is clearly continuous as a composition of continuous functions. We only need to verify continuity atp = 0. Standard results in information theory have shown that the KL divergence vanishes sublinearly when the two distributions are close, see e.g. Proposition 2.20 in (Polyanskiy and Wu, 2025): ∂ ∂p KL(µN ∥µp)...

  8. [8]

    For p∈ (0, 1), let ν := µK + µN and write fK := dµK dν , fN := dµN dν , so that the mixtureµp =pµ K + (1−p)µ N has density mp := dµp dν =pf K + (1−p)f N

    Differentiability in p.Fix µK, µN. For p∈ (0, 1), let ν := µK + µN and write fK := dµK dν , fN := dµN dν , so that the mixtureµp =pµ K + (1−p)µ N has density mp := dµp dν =pf K + (1−p)f N . Using (2), we can write I(p) :=I(X; ˆX) =p Z fK log fK mp dν+ (1−p) Z fN log fN mp dν. We first show thatI ′(p) = KL(µK∥µp) −KL (µN ∥µp). For convenience, we assume th...

  9. [9]

    , uin the sequencesx u andy u

    Let ˆPX,Y (x′, y′|xu, yu)be the joint empirical distribution ofxi and yi for i = 1, . . . , uin the sequencesx u andy u. Then, ˆPX,Y (x′, y′|xu, yu)−P X(x′)PY|X (y′|x′) ≤γ

  10. [10]

    WheneverP Y|X (y′|x′) = 0, ˆPX,Y (x′, y′|xu, yu) = 0. It follows from uniform continuity that, for sufficiently smallγ, all P ∗ Y|X -typical sequences yu under condition xu will have the empirical distribution of(xi, yi)close to P ∗ X,Y, and thus satisfy(4) and (5) for the desired choice ofδ. By Lemma 2.13 in Csiszár and Körner (2011), there exist sequenc...

  11. [11]

    Then µ∗ N = δ0

    If the minimum is uniquely atx = 0. Then µ∗ N = δ0. This implies C∗ = 0 λK = 0(since λK >1). This leads toJ(µ ∗ N) =∞, which is not optimal

  12. [12]

    Then µ∗ N = δx∗

    If the minimum is uniquely atx∗. Then µ∗ N = δx∗. This implies µ∗ K = µ∗ N and KL = 0, a contradiction. We conclude thatµ∗ N is a two-point distribution: µ∗ N = (1−q ∗)δ0 +q ∗δx∗. We can now determineµ∗ K using the relative density in(7): because dµ∗ K dµ∗ N (0) = 0λK = 0, it follows that it is a point massµ∗ K =δ x∗. Now we use the tight constraints to d...

  13. [13]

    Constraint onµ ∗ K:E µ∗ K[−lnX] =ε K =⇒ −ln(x ∗) =ε K =⇒x ∗ =e −εK

  14. [14]

    (1−q ∗)(−ln 1) +q ∗(−ln(1−x ∗)) =ε N =⇒q ∗ = εN −ln(1−x ∗)

    Constraint onµ ∗ N:E µ∗ N [−ln(1−X)] =ε N. (1−q ∗)(−ln 1) +q ∗(−ln(1−x ∗)) =ε N =⇒q ∗ = εN −ln(1−x ∗) . The conditione −εK +e −εN >1ensures thatε N <−ln(1−e −εK), so0< q ∗ <1. Step 4: verifying KKT conditions.Note that we calculate C∗ = Eµ∗ N [X λK] = q∗(x∗)λK (since λK >1). Condition 1:g(x ∗) = 0. g(x∗) =− (x∗)λK C∗ −λ N ln(1−x ∗) =− 1 q∗ −λ N ln(1−x ∗) ...

  15. [15]

    (Constraints preserved)E ¯µK[1− ˆX] =E µK[1− ˆX]andE ¯µN [ ˆX] =E µN [ ˆX]

  16. [16]

    Consequently, in the definition ofRp(εK, εN)we may restrict toˆx∈ {0,1}, and the optimal output distributions are Bernoulli

    (Objective non-increasing) For everyp∈(0,1), Fp(¯µK,¯µN)≤F p(µK, µN). Consequently, in the definition ofRp(εK, εN)we may restrict toˆx∈ {0,1}, and the optimal output distributions are Bernoulli. Proof sketch.The first part follows from linearity: underT, the binary outputY∈ { 0, 1} satisfies E[Y| ˆX = t] = t, hence E[Y ] = E[ ˆX]and E[1 −Y ] = E[1 − ˆX]. ...