Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

Anxin Guo; Jingwei Li

arxiv: 2602.00906 · v6 · submitted 2026-01-31 · 💻 cs.LG · cs.AI· cs.CL· cs.DS· cs.IT· math.IT

Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

Anxin Guo , Jingwei Li This is my paper

Pith reviewed 2026-05-16 08:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.DScs.ITmath.IT

keywords hallucinationrate-distortionmembership testingBloom filterslog-losslarge language modelsinformation theorysparse facts

0 comments

The pith

Even with perfect data and training, limited capacity forces LLMs to assign high confidence to some non-facts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models the memorization of sparse facts as a membership testing problem that unifies Bloom filter error rates with the log-loss used to train language models. It derives a rate-distortion theorem showing that the lowest achievable distortion under a capacity constraint equals the minimum KL divergence between score distributions on facts versus non-facts. In the sparse regime this minimum is strictly positive, so the optimal strategy necessarily includes confident errors on some non-facts. Those errors are precisely what appear as hallucinations. Synthetic experiments confirm that the predicted errors persist even when training is optimal and the world is closed.

Core claim

The information-theoretically optimal strategy for membership testing under limited capacity, when facts are sparse, is characterized by the minimum KL divergence between the score distributions on facts and non-facts; this value is positive, so the optimal policy assigns high to some non-facts, which manifests as hallucination.

What carries the argument

The rate-distortion function for membership testing, defined as the minimum KL divergence between score distributions on facts and non-facts under a capacity (rate) constraint.

If this is right

Hallucinations remain even with perfect training data and optimal optimization in a closed world.
Abstaining or responding with low is not always capacity-optimal compared with confident errors on non-facts.
The same rate-distortion tradeoff unifies discrete set-membership errors with continuous probabilistic losses.
Increasing capacity reduces but does not eliminate the forced error rate unless the fact density changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Retrieval-augmented systems could lower the effective sparsity and thereby reduce the predicted hallucination rate.
The same tradeoff may appear in any learning system that compresses a sparse set of positive examples.
Measuring the empirical KL divergence between fact and non-fact scores on real models could test the predicted lower bound.

Load-bearing premise

Facts are sparse relative to the space of all plausible claims, so that perfect separation is not the most efficient use of limited memory.

What would settle it

A concrete strategy or trained model that achieves lower average log-loss than the predicted minimum KL divergence at the same capacity, or that reaches zero distortion on non-facts without using more bits.

Figures

Figures reproduced from arXiv: 2602.00906 by Anxin Guo, Jingwei Li.

**Figure 2.** Figure 2: Effect of different weight λF on fact. Each color represents a different model size. The amount of information per key decreases with λF for each model size. et al., 2024). Theorem 4.3 is agnostic to how xˆ is produced. Whether the decision is derived from a calibrated probability, a raw logit, a log-likelihood ratio, or a multi-stage verification chain that outputs a scalar score, the final thresholded ma… view at source ↗

**Figure 3.** Figure 3: Output distributions with 15145 facts and 8767 parameters. [PITH_FULL_IMAGE:figures/full_fig_p044_3.png] view at source ↗

**Figure 4.** Figure 4: Output distributions with 15145 facts and 33085 parameters. [PITH_FULL_IMAGE:figures/full_fig_p044_4.png] view at source ↗

read the original abstract

Large language models often hallucinate with high confidence on "random facts" that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination: even with optimal training, perfect data, and a simplified "closed world" setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on synthetic data, showing that hallucinations persist as a natural consequence of lossy compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames hallucinations as the information-theoretically optimal outcome of lossy compression on sparse facts, but the modeling choices look too narrow to carry over to real LLMs.

read the letter

The central claim is that even perfect training on perfect data forces high-confidence errors on non-facts once memory is constrained, because the rate-distortion optimum for membership testing is a minimum KL between the score distributions on facts versus non-facts. That is the punchline, and it is new relative to the usual training-defect story. The unification of Bloom-filter false-positive rates with continuous log-loss under one objective is a neat move, and the synthetic checks confirm the math works inside the stated regime. Credit for making the argument explicit rather than hand-wavy. The derivation is not shown in the abstract, but the closed-form characterization itself is the contribution worth checking. The soft spots sit in the assumptions. The result requires facts to be sparse in a discrete closed universe with no inferable structure and with symmetric log-loss as the exact distortion measure. Relax any of those and low-confidence or abstention strategies can achieve the same rate without producing the claimed high-confidence hallucinations. Real language data has patterns, hierarchies, and open-ended generation, so the bridge from the synthetic case to LLMs is the part that needs the most scrutiny. Experiments stay inside the model, which is appropriate for a theorem but leaves the practical relevance open. This is for theorists working on memorization, generalization bounds, or information-theoretic views of scaling. A reader already thinking about capacity limits will find the framing useful even if they reject the extrapolation. It is coherent on its own terms and deserves a serious referee to pressure the assumptions and see whether the theorem can be stated more generally.

Referee Report

3 major / 2 minor

Summary. The paper claims that hallucinations in LLMs arise as an inevitable consequence of space-optimal memorization of sparse facts. It formalizes the problem as membership testing, unifies Bloom-filter false-positive rates with continuous log-loss via a rate-distortion objective, and proves that the capacity-constrained optimum is characterized by the minimum KL divergence between score distributions on facts versus non-facts. This optimum forces high-confidence assignment to some non-facts rather than abstention. The claim is supported by a synthetic-data validation showing persistent hallucinations under optimal compression in a closed-world sparse regime.

Significance. If the central theorem is correct, the work supplies a clean information-theoretic explanation for why hallucinations survive even perfect data and optimal training, shifting focus from training artifacts to fundamental capacity limits. The unification of discrete set-membership distortion with log-loss is technically interesting and could inform capacity-aware evaluation protocols.

major comments (3)

[Section 3] The rate-distortion theorem (Section 3): the manuscript states that the optimal memory is characterized by min KL between fact and non-fact score distributions, yet supplies neither the derivation of the rate-distortion function nor the explicit optimization that shows why the minimizer must place high mass on non-facts. Without these steps it is impossible to confirm that the claimed behavior follows from the stated objective rather than from the modeling assumptions.
[Section 2] Problem formulation (Section 2): the result that abstention is suboptimal holds only under the joint assumptions of discrete sparse claims, symmetric log-loss distortion, and absence of inferable structure. No sensitivity analysis or alternative distortion is examined; relaxing any of these (as noted in the skeptic analysis) allows low-confidence or abstention solutions to achieve the same rate, undermining the generality of the hallucination explanation.
[Section 5] Empirical validation (Section 5): the synthetic experiments are reported to confirm the theory inside the assumed regime, but the text provides neither the precise experimental protocol (number of trials, exact capacity values, statistical controls) nor quantitative baselines that would demonstrate the hallucinations are produced by the rate-distortion optimum rather than by implementation details.

minor comments (2)

[Section 2] Notation for the unified distortion measure is introduced without an explicit equation linking the Bloom-filter error probability to the log-loss term; adding this would clarify the unification.
The abstract and introduction cite the closed-world assumption but do not reference prior rate-distortion analyses of memorization or Bloom-filter literature in learning theory.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate the requested clarifications, derivations, and experimental details while preserving the core claims.

read point-by-point responses

Referee: [Section 3] The rate-distortion theorem (Section 3): the manuscript states that the optimal memory is characterized by min KL between fact and non-fact score distributions, yet supplies neither the derivation of the rate-distortion function nor the explicit optimization that shows why the minimizer must place high mass on non-facts. Without these steps it is impossible to confirm that the claimed behavior follows from the stated objective rather than from the modeling assumptions.

Authors: We agree that the derivation steps require more explicit presentation. In the revised version we will add a dedicated subsection deriving the rate-distortion function from the mutual-information rate constraint and the expected symmetric log-loss distortion. The optimization will be shown to yield a minimizer that places positive mass on non-facts because any abstention strategy increases the required rate to keep distortion below the target threshold; the minimum KL characterization follows directly from the Lagrange dual of the constrained problem. revision: yes
Referee: [Section 2] Problem formulation (Section 2): the result that abstention is suboptimal holds only under the joint assumptions of discrete sparse claims, symmetric log-loss distortion, and absence of inferable structure. No sensitivity analysis or alternative distortion is examined; relaxing any of these (as noted in the skeptic analysis) allows low-confidence or abstention solutions to achieve the same rate, undermining the generality of the hallucination explanation.

Authors: The assumptions are stated explicitly in Section 2 as the modeling choices that isolate the sparse, unstructured regime relevant to random-fact hallucinations. We will revise the text to include a short sensitivity paragraph clarifying that the sub-optimality of abstention is specific to this regime and that structured or inferable facts fall outside the theorem’s scope. While a comprehensive sensitivity study lies beyond the present scope, we will note that alternative distortions permitting abstention correspond to different problem settings with higher effective capacity requirements. revision: partial
Referee: [Section 5] Empirical validation (Section 5): the synthetic experiments are reported to confirm the theory inside the assumed regime, but the text provides neither the precise experimental protocol (number of trials, exact capacity values, statistical controls) nor quantitative baselines that would demonstrate the hallucinations are produced by the rate-distortion optimum rather than by implementation details.

Authors: We will expand Section 5 with the missing protocol details: 100 independent trials, capacity values of 0.1–0.5 bits per fact, bootstrap confidence intervals, and two quantitative baselines (uniform random scoring and a simple threshold classifier). These additions will show that the observed hallucination rates match the theoretical minimum-KL predictions rather than arising from implementation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: rate-distortion optimum follows directly from stated distortion and sparsity assumptions

full rationale

The derivation begins with an explicit rate-distortion formulation for membership testing that unifies Bloom-filter error with log-loss. The optimal strategy is characterized as the minimum KL divergence between score distributions on facts versus non-facts; under the paper's stated regime of sparse facts in a closed discrete universe, this minimum is achieved by assigning high confidence to some non-facts. This is a mathematical consequence of the chosen distortion measure and capacity constraint, not a redefinition or fit. No self-citations, fitted parameters, or ansatzes are invoked to close the argument, and the synthetic experiments simply instantiate the same objective. The result is therefore self-contained against the stated modeling choices.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that facts are sparse and on the modeling decision to treat discrete set-membership error and continuous log-loss as instances of the same rate-distortion problem.

axioms (1)

domain assumption Facts are sparse in the universe of plausible claims
Invoked to define the operating regime in which the rate-distortion characterization is derived.

pith-pipeline@v0.9.0 · 5473 in / 1272 out tokens · 35853 ms · 2026-05-16T08:33:35.435664+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the minimum memory budget per key is characterized by the minimum KL-divergence between key and non-key output distributions... KL(µK∥µN)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unique loss-minimizing query output... assign high confidence to all facts, while assigning either zero or fact-level confidence to non-facts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Llms will always hallucinate, and we need to live with this,

URLhttps://arxiv.org/abs/2510.06265. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URLhttps://openreview.net/forum?id= 5x788rqbcj. Zeyuan Allen-Zhu and Yuanzhi Li. Physic...

work page doi:10.48550/arxiv.2409.05746 2024
[2]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

doi: 10.1145/3442188.3445922. URLhttps://doi.org/10.1145/3442188.3445922. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors.Commun. ACM, 13(7): 422–426, 1970. doi: 10.1145/362686.362692. URLhttps://doi.org/10.1145/362686.362692. Avrim Blum and John Langford. Pac-mdl bounds. InLearning Theory and Kernel Machines: 16th Annual Confe...

work page doi:10.1145/3442188.3445922 1970
[3]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

URLhttps://arxiv.org/abs/2507.22915. Fredrik Hellstr"om, Giuseppe Durisi, Benjamin Guedj, and Maxim Raginsky. Generalization bounds: Perspectives from information theory and pac-bayes.Foundations and Trends in Machine Learning, 18(1):1–223, 2025. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. I...

work page doi:10.1145/3703155 2025
[4]

world state

doi: 10.1007/3-540-44676-1\_25. URLhttps://doi.org/10.1007/3-540-44676-1_25. Zhixuan Pan, Shaowen Wang, and Jian Li. Understanding LLM behaviors via compression: Data generation, knowledge acquisition and scaling laws.CoRR, abs/2504.09597, 2025. doi: 10.48550/ ARXIV.2504.09597. URLhttps://doi.org/10.48550/arXiv.2504.09597. 20 Michal Perelkiewicz and Rafal...

work page doi:10.1007/3-540-44676-1 2025
[5]

2.F p(µK, µN)is continuous inp

The function(p, µK, µN)7→F p(µK, µN)is jointly lower semi-continuous. 2.F p(µK, µN)is continuous inp. 3.F p is differentiable inpwith ∂ ∂p Fp(µK, µN) =− KL(µN ∥pµK+(1−p)µN) p2 wheneverp >0. Proof.We prove the three properties using the following identity: I(X; ˆX) = KL(P X, ˆX ∥PX ⊗P ˆX) =E X∼Bern(p) KL(P ˆX|X ∥P ˆX) =pKL(µ K∥µp) + (1−p)KL(µ N ∥µp),(2) wh...

work page
[6]

First, observe that the map( p, µK, µN) 7→µ p = pµK + (1 −p )µN is continuous from the product topology to the weak-* topology

Joint Lower Semi-Continuity.For any sequence( pn, µK,n, µN,n) → (p, µK, µN), we show that lim infF pn ≥F p. First, observe that the map( p, µK, µN) 7→µ p = pµK + (1 −p )µN is continuous from the product topology to the weak-* topology. Since KL divergence is jointly lower semi-continuous (LSC) with respect to the weak-* topology, it is immediate thatFp(µK...

work page
[7]

We only need to verify continuity atp = 0

Continuity inp.For fixed µK, µN, and p > 0, Fp is clearly continuous as a composition of continuous functions. We only need to verify continuity atp = 0. Standard results in information theory have shown that the KL divergence vanishes sublinearly when the two distributions are close, see e.g. Proposition 2.20 in (Polyanskiy and Wu, 2025): ∂ ∂p KL(µN ∥µp)...

work page 2025
[8]

For p∈ (0, 1), let ν := µK + µN and write fK := dµK dν , fN := dµN dν , so that the mixtureµp =pµ K + (1−p)µ N has density mp := dµp dν =pf K + (1−p)f N

Differentiability in p.Fix µK, µN. For p∈ (0, 1), let ν := µK + µN and write fK := dµK dν , fN := dµN dν , so that the mixtureµp =pµ K + (1−p)µ N has density mp := dµp dν =pf K + (1−p)f N . Using (2), we can write I(p) :=I(X; ˆX) =p Z fK log fK mp dν+ (1−p) Z fN log fN mp dν. We first show thatI ′(p) = KL(µK∥µp) −KL (µN ∥µp). For convenience, we assume th...

work page 2011
[9]

, uin the sequencesx u andy u

Let ˆPX,Y (x′, y′|xu, yu)be the joint empirical distribution ofxi and yi for i = 1, . . . , uin the sequencesx u andy u. Then, ˆPX,Y (x′, y′|xu, yu)−P X(x′)PY|X (y′|x′) ≤γ

work page
[10]

WheneverP Y|X (y′|x′) = 0, ˆPX,Y (x′, y′|xu, yu) = 0. It follows from uniform continuity that, for sufficiently smallγ, all P ∗ Y|X -typical sequences yu under condition xu will have the empirical distribution of(xi, yi)close to P ∗ X,Y, and thus satisfy(4) and (5) for the desired choice ofδ. By Lemma 2.13 in Csiszár and Körner (2011), there exist sequenc...

work page 2011
[11]

Then µ∗ N = δ0

If the minimum is uniquely atx = 0. Then µ∗ N = δ0. This implies C∗ = 0 λK = 0(since λK >1). This leads toJ(µ ∗ N) =∞, which is not optimal

work page
[12]

Then µ∗ N = δx∗

If the minimum is uniquely atx∗. Then µ∗ N = δx∗. This implies µ∗ K = µ∗ N and KL = 0, a contradiction. We conclude thatµ∗ N is a two-point distribution: µ∗ N = (1−q ∗)δ0 +q ∗δx∗. We can now determineµ∗ K using the relative density in(7): because dµ∗ K dµ∗ N (0) = 0λK = 0, it follows that it is a point massµ∗ K =δ x∗. Now we use the tight constraints to d...

work page
[13]

Constraint onµ ∗ K:E µ∗ K[−lnX] =ε K =⇒ −ln(x ∗) =ε K =⇒x ∗ =e −εK

work page
[14]

(1−q ∗)(−ln 1) +q ∗(−ln(1−x ∗)) =ε N =⇒q ∗ = εN −ln(1−x ∗)

Constraint onµ ∗ N:E µ∗ N [−ln(1−X)] =ε N. (1−q ∗)(−ln 1) +q ∗(−ln(1−x ∗)) =ε N =⇒q ∗ = εN −ln(1−x ∗) . The conditione −εK +e −εN >1ensures thatε N <−ln(1−e −εK), so0< q ∗ <1. Step 4: verifying KKT conditions.Note that we calculate C∗ = Eµ∗ N [X λK] = q∗(x∗)λK (since λK >1). Condition 1:g(x ∗) = 0. g(x∗) =− (x∗)λK C∗ −λ N ln(1−x ∗) =− 1 q∗ −λ N ln(1−x ∗) ...

work page
[15]

(Constraints preserved)E ¯µK[1− ˆX] =E µK[1− ˆX]andE ¯µN [ ˆX] =E µN [ ˆX]

work page
[16]

Consequently, in the definition ofRp(εK, εN)we may restrict toˆx∈ {0,1}, and the optimal output distributions are Bernoulli

(Objective non-increasing) For everyp∈(0,1), Fp(¯µK,¯µN)≤F p(µK, µN). Consequently, in the definition ofRp(εK, εN)we may restrict toˆx∈ {0,1}, and the optimal output distributions are Bernoulli. Proof sketch.The first part follows from linearity: underT, the binary outputY∈ { 0, 1} satisfies E[Y| ˆX = t] = t, hence E[Y ] = E[ ˆX]and E[1 −Y ] = E[1 − ˆX]. ...

work page 2009

[1] [1]

Llms will always hallucinate, and we need to live with this,

URLhttps://arxiv.org/abs/2510.06265. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URLhttps://openreview.net/forum?id= 5x788rqbcj. Zeyuan Allen-Zhu and Yuanzhi Li. Physic...

work page doi:10.48550/arxiv.2409.05746 2024

[2] [2]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

doi: 10.1145/3442188.3445922. URLhttps://doi.org/10.1145/3442188.3445922. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors.Commun. ACM, 13(7): 422–426, 1970. doi: 10.1145/362686.362692. URLhttps://doi.org/10.1145/362686.362692. Avrim Blum and John Langford. Pac-mdl bounds. InLearning Theory and Kernel Machines: 16th Annual Confe...

work page doi:10.1145/3442188.3445922 1970

[3] [3]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

URLhttps://arxiv.org/abs/2507.22915. Fredrik Hellstr"om, Giuseppe Durisi, Benjamin Guedj, and Maxim Raginsky. Generalization bounds: Perspectives from information theory and pac-bayes.Foundations and Trends in Machine Learning, 18(1):1–223, 2025. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. I...

work page doi:10.1145/3703155 2025

[4] [4]

world state

doi: 10.1007/3-540-44676-1\_25. URLhttps://doi.org/10.1007/3-540-44676-1_25. Zhixuan Pan, Shaowen Wang, and Jian Li. Understanding LLM behaviors via compression: Data generation, knowledge acquisition and scaling laws.CoRR, abs/2504.09597, 2025. doi: 10.48550/ ARXIV.2504.09597. URLhttps://doi.org/10.48550/arXiv.2504.09597. 20 Michal Perelkiewicz and Rafal...

work page doi:10.1007/3-540-44676-1 2025

[5] [5]

2.F p(µK, µN)is continuous inp

The function(p, µK, µN)7→F p(µK, µN)is jointly lower semi-continuous. 2.F p(µK, µN)is continuous inp. 3.F p is differentiable inpwith ∂ ∂p Fp(µK, µN) =− KL(µN ∥pµK+(1−p)µN) p2 wheneverp >0. Proof.We prove the three properties using the following identity: I(X; ˆX) = KL(P X, ˆX ∥PX ⊗P ˆX) =E X∼Bern(p) KL(P ˆX|X ∥P ˆX) =pKL(µ K∥µp) + (1−p)KL(µ N ∥µp),(2) wh...

work page

[6] [6]

First, observe that the map( p, µK, µN) 7→µ p = pµK + (1 −p )µN is continuous from the product topology to the weak-* topology

Joint Lower Semi-Continuity.For any sequence( pn, µK,n, µN,n) → (p, µK, µN), we show that lim infF pn ≥F p. First, observe that the map( p, µK, µN) 7→µ p = pµK + (1 −p )µN is continuous from the product topology to the weak-* topology. Since KL divergence is jointly lower semi-continuous (LSC) with respect to the weak-* topology, it is immediate thatFp(µK...

work page

[7] [7]

We only need to verify continuity atp = 0

Continuity inp.For fixed µK, µN, and p > 0, Fp is clearly continuous as a composition of continuous functions. We only need to verify continuity atp = 0. Standard results in information theory have shown that the KL divergence vanishes sublinearly when the two distributions are close, see e.g. Proposition 2.20 in (Polyanskiy and Wu, 2025): ∂ ∂p KL(µN ∥µp)...

work page 2025

[8] [8]

For p∈ (0, 1), let ν := µK + µN and write fK := dµK dν , fN := dµN dν , so that the mixtureµp =pµ K + (1−p)µ N has density mp := dµp dν =pf K + (1−p)f N

Differentiability in p.Fix µK, µN. For p∈ (0, 1), let ν := µK + µN and write fK := dµK dν , fN := dµN dν , so that the mixtureµp =pµ K + (1−p)µ N has density mp := dµp dν =pf K + (1−p)f N . Using (2), we can write I(p) :=I(X; ˆX) =p Z fK log fK mp dν+ (1−p) Z fN log fN mp dν. We first show thatI ′(p) = KL(µK∥µp) −KL (µN ∥µp). For convenience, we assume th...

work page 2011

[9] [9]

, uin the sequencesx u andy u

Let ˆPX,Y (x′, y′|xu, yu)be the joint empirical distribution ofxi and yi for i = 1, . . . , uin the sequencesx u andy u. Then, ˆPX,Y (x′, y′|xu, yu)−P X(x′)PY|X (y′|x′) ≤γ

work page

[10] [10]

WheneverP Y|X (y′|x′) = 0, ˆPX,Y (x′, y′|xu, yu) = 0. It follows from uniform continuity that, for sufficiently smallγ, all P ∗ Y|X -typical sequences yu under condition xu will have the empirical distribution of(xi, yi)close to P ∗ X,Y, and thus satisfy(4) and (5) for the desired choice ofδ. By Lemma 2.13 in Csiszár and Körner (2011), there exist sequenc...

work page 2011

[11] [11]

Then µ∗ N = δ0

If the minimum is uniquely atx = 0. Then µ∗ N = δ0. This implies C∗ = 0 λK = 0(since λK >1). This leads toJ(µ ∗ N) =∞, which is not optimal

work page

[12] [12]

Then µ∗ N = δx∗

If the minimum is uniquely atx∗. Then µ∗ N = δx∗. This implies µ∗ K = µ∗ N and KL = 0, a contradiction. We conclude thatµ∗ N is a two-point distribution: µ∗ N = (1−q ∗)δ0 +q ∗δx∗. We can now determineµ∗ K using the relative density in(7): because dµ∗ K dµ∗ N (0) = 0λK = 0, it follows that it is a point massµ∗ K =δ x∗. Now we use the tight constraints to d...

work page

[13] [13]

Constraint onµ ∗ K:E µ∗ K[−lnX] =ε K =⇒ −ln(x ∗) =ε K =⇒x ∗ =e −εK

work page

[14] [14]

(1−q ∗)(−ln 1) +q ∗(−ln(1−x ∗)) =ε N =⇒q ∗ = εN −ln(1−x ∗)

Constraint onµ ∗ N:E µ∗ N [−ln(1−X)] =ε N. (1−q ∗)(−ln 1) +q ∗(−ln(1−x ∗)) =ε N =⇒q ∗ = εN −ln(1−x ∗) . The conditione −εK +e −εN >1ensures thatε N <−ln(1−e −εK), so0< q ∗ <1. Step 4: verifying KKT conditions.Note that we calculate C∗ = Eµ∗ N [X λK] = q∗(x∗)λK (since λK >1). Condition 1:g(x ∗) = 0. g(x∗) =− (x∗)λK C∗ −λ N ln(1−x ∗) =− 1 q∗ −λ N ln(1−x ∗) ...

work page

[15] [15]

(Constraints preserved)E ¯µK[1− ˆX] =E µK[1− ˆX]andE ¯µN [ ˆX] =E µN [ ˆX]

work page

[16] [16]

Consequently, in the definition ofRp(εK, εN)we may restrict toˆx∈ {0,1}, and the optimal output distributions are Bernoulli

(Objective non-increasing) For everyp∈(0,1), Fp(¯µK,¯µN)≤F p(µK, µN). Consequently, in the definition ofRp(εK, εN)we may restrict toˆx∈ {0,1}, and the optimal output distributions are Bernoulli. Proof sketch.The first part follows from linearity: underT, the binary outputY∈ { 0, 1} satisfies E[Y| ˆX = t] = t, hence E[Y ] = E[ ˆX]and E[1 −Y ] = E[1 − ˆX]. ...

work page 2009