A Sufficient-Statistic Reduction of the Information Bottleneck to a Low-Dimensional Problem

Joss Armstrong

arxiv: 2604.26744 · v1 · submitted 2026-04-29 · 💻 cs.IT · math.IT· stat.ML

A Sufficient-Statistic Reduction of the Information Bottleneck to a Low-Dimensional Problem

Joss Armstrong This is my paper

Pith reviewed 2026-05-07 12:39 UTC · model grok-4.3

classification 💻 cs.IT math.ITstat.ML

keywords information bottlenecksufficient statisticdimensionality reductionGaussian information bottleneckinformation theoryrate-distortion

0 comments

The pith

If the target conditional factors through a sufficient statistic, the Information Bottleneck problem on the full source is exactly equivalent to the problem on the reduced statistic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that whenever p(C|T) equals p(C|φ(T)) for some function φ, every optimal representation and every point on the IB tradeoff curve for the pair (T,C) is recovered exactly by solving the same IB problem on the pair (φ(T),C). The equivalence is lossless: the minimal mutual information values match at every trade-off parameter, and the optimal mappings differ only by composition with φ. As a direct result, the cost of computing the IB curve is governed by the dimension of φ rather than the ambient dimension of T. The classical closed-form Gaussian IB solution emerges immediately as the special case in which φ is the identity on the mean and covariance, and a nonlinear-Gaussian extension follows the same route.

Core claim

If p(C|T) factors through a sufficient statistic φ(T), then the Information Bottleneck functional I(T;R) − η I(R;C) attains the same minimum value for every η when R is optimized over functions of T or over functions of φ(T), and the optimal representations are related by pullback through φ; consequently the entire IB curve is identical for the original and reduced problems.

What carries the argument

The sufficient statistic φ(T) for the target C, defined by the factorization p(C|T) = p(C|φ(T)), which collapses the source while preserving all information relevant to the IB objective.

If this is right

The complexity of solving the IB problem scales with the dimension of the sufficient statistic rather than the dimension of the original source.
The known closed-form Gaussian IB solution is recovered as the special case in which φ extracts the sufficient statistics of the Gaussian conditional.
The same reduction immediately yields a nonlinear-Gaussian generalization of the IB solution.
When a low-dimensional sufficient statistic is known, the exact IB curve can be computed by solving the reduced problem at cost set by that statistic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reduction supplies an exact structural condition that separates tractable from intractable IB instances.
The same factorization argument may apply to other information-theoretic objectives that depend on the source only through its conditional on a target.
In settings where approximate sufficient statistics can be learned, the reduction offers a route to approximate but still exact-on-the-curve IB solutions.

Load-bearing premise

There exists a function φ such that the conditional distribution of the target given the source depends on the source only through φ.

What would settle it

An explicit pair of distributions where p(C|T) = p(C|φ(T)) yet the minimal IB Lagrangian value for (T,C) differs from the value for (φ(T),C) at some η.

read the original abstract

We show that if the conditional distribution p(C | T) factors through a sufficient statistic {\phi}(T), then the Information Bottleneck (IB) problem for (T, C) is exactly equivalent to the IB problem for ({\phi}(T), C). The reduction is loss-free: it preserves the full IB curve, the Lagrangian optimum at every trade-off parameter \b{eta}, and the optimal representations up to pullback through {\phi}. As a result, the computational complexity of solving the IB problem is governed by the dimension of the sufficient statistic rather than the ambient dimension of the source. This identifies an exact structural condition under which the generic IB problem becomes tractable, and gives a formal bridge between the discrete and linear-Gaussian regimes. We then show that the classical Gaussian IB solution of Chechik, Globerson, Tishby and Weiss is an immediate corollary of this reduction, and we state a nonlinear-Gaussian generalisation. A small numerical example illustrates the practical consequence: when a low-dimensional sufficient statistic is available, the exact IB curve can be computed on the reduced problem at a cost determined by the statistic rather than by the ambient source dimension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that the IB problem reduces exactly to a low-dimensional version on any sufficient statistic for the target, with the Gaussian solution as a direct corollary.

read the letter

The central result is that if φ(T) is a sufficient statistic for C given T, then the full IB optimization over representations of T is equivalent to the same problem over representations of φ(T). The IB curve, every Lagrangian optimum, and the optimal mappings are preserved exactly, with no loss. This follows directly from the Markov chain C ⊥ T | φ(T) and the chain-rule decomposition of mutual information, which shows that any representation depending on T can be replaced by one depending only on φ(T) without increasing the objective. The paper treats the classical Gaussian IB as an immediate consequence rather than a separate derivation, and it states a nonlinear-Gaussian extension. A small numerical example illustrates the computational saving when the statistic is low-dimensional. The argument is clean and uses only standard properties of mutual information and sufficiency. The main limitation is that the reduction applies only when a low-dimensional sufficient statistic is already known or easy to identify; the paper does not address how to find one in general. The example is too small to demonstrate scaling behavior on larger problems. Overall the math holds up without circularity or hidden assumptions. This is useful for anyone working on IB in settings with known structure, such as certain representation-learning or compression tasks. It is worth sending to peer review because the claim is precise, the derivation is short and verifiable, and the reduction could make some previously intractable IB instances practical.

Referee Report

0 major / 2 minor

Summary. The manuscript shows that if p(C|T) factors through a sufficient statistic φ(T), then the Information Bottleneck problem for the pair (T,C) is exactly equivalent to the IB problem for (φ(T),C). The equivalence is loss-free: it preserves the full IB curve, the Lagrangian optimum for every trade-off parameter η, and the optimal representations up to pullback through φ. The authors recover the classical linear-Gaussian IB solution of Chechik et al. as an immediate corollary, state a nonlinear-Gaussian generalization, and illustrate the computational reduction with a small numerical example.

Significance. If the central reduction holds, the result is significant: it supplies an exact structural condition under which the generally intractable IB problem reduces to a problem whose complexity is governed by the dimension of the sufficient statistic rather than the ambient dimension of T. This provides a formal bridge between discrete and linear-Gaussian regimes and explains why the Gaussian IB admits a closed-form solution. The loss-free preservation of the entire curve and all optima strengthens the practical utility of the reduction for high-dimensional sources.

minor comments (2)

Abstract: the claim that the reduction 'identifies an exact structural condition under which the generic IB problem becomes tractable' should be qualified by noting that tractability remains governed by the cardinality or dimension of φ(T) itself.
The nonlinear-Gaussian generalisation is announced but its precise statement and derivation are not visible in the provided abstract; a short proof sketch or explicit statement of the resulting objective would aid verification.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our manuscript, the assessment of its significance, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core reduction follows directly from the standard definition of a sufficient statistic (p(C|T) = p(C|φ(T))) and the chain-rule decomposition of mutual information I(T;Z) = I(φ(T);Z) + I(T;Z|φ(T)) ≥ I(φ(T);Z), both of which are external mathematical facts independent of the paper. The equivalence of the full IB curve, Lagrangian optima, and optimal representations is a straightforward consequence of the induced Markov chain C ⊥ T | φ(T) and does not involve any fitted parameters, self-referential predictions, or load-bearing self-citations. The Gaussian IB result is recovered as a corollary from prior independent work (Chechik et al.), and the nonlinear-Gaussian extension is stated without circularity. No step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the standard definition of sufficient statistics and the Lagrangian formulation of IB, both taken from prior information-theory literature; no free parameters, new entities, or ad-hoc assumptions are introduced in the abstract.

axioms (2)

standard math Definition of sufficient statistic: φ(T) is sufficient for C if p(C|T) = p(C|φ(T))
Invoked directly in the statement that p(C|T) factors through φ(T)
domain assumption IB objective is the Lagrangian I(T;Z) - η I(Z;C) minimized over representations Z
The equivalence is claimed to hold for the full curve and every η

pith-pipeline@v0.9.0 · 5503 in / 1515 out tokens · 106634 ms · 2026-05-07T12:39:02.711234+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

Emergence of invariance and disentanglement in deep representations.Journal of Machine Learning Research, 19(50):1–34, 2018

Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations.Journal of Machine Learning Research, 19(50):1–34, 2018

work page 2018
[2]

Alemi, Ian Fischer, Joshua V

Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. InInternational Conference on Learning Representations, Toulon, France, 2017. Poster

work page 2017
[3]

Alemi, Ben Poole, Ian Fischer, Joshua V

Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V. Dillon, Rif A. Saurous, and Kevin Murphy. Fixing a broken ELBO. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 159–168, 2018

work page 2018
[4]

Richard E. Blahut. Computation of channel capacity and rate-distortion functions.IEEE Transactions on Information Theory, 18(4):460–473, 1972

work page 1972
[5]

Information bottleneck for gaussian variables.Journal of Machine Learning Research, 6(6):165–188, 2005

Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables.Journal of Machine Learning Research, 6(6):165–188, 2005. 8

work page 2005
[6]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, Hoboken, NJ, 2 edition, 2006

work page 2006
[7]

Springer, New York, 2 edition, 2002

Olav Kallenberg.Foundations of Modern Probability. Springer, New York, 2 edition, 2002

work page 2002
[8]

Gaussian lower bound for the information bottleneck limit.Journal of Machine Learning Research, 18(213):1–29, 2018

Amichai Painsky and Naftali Tishby. Gaussian lower bound for the information bottleneck limit.Journal of Machine Learning Research, 18(213):1–29, 2018

work page 2018
[9]

Information- based clustering.Proceedings of the National Academy of Sciences of the United States of America, 102(51):18297–18302, 2005

Noam Slonim, Gurinder Singh Atwal, Gaˇ sper Tkaˇ cik, and William Bialek. Information- based clustering.Proceedings of the National Academy of Sciences of the United States of America, 102(51):18297–18302, 2005

work page 2005
[10]

Naftali Tishby, Fernando C. N. Pereira, and William Bialek. The information bottleneck method. InProceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, pages 368–377, Monticello, IL, 1999. A Notation summary Symbol Meaning T,C,Xsource, relevance variable, representation φsufficient statistic mapφ:T → Z IT,C(R) IB info...

work page 1999

[1] [1]

Emergence of invariance and disentanglement in deep representations.Journal of Machine Learning Research, 19(50):1–34, 2018

Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations.Journal of Machine Learning Research, 19(50):1–34, 2018

work page 2018

[2] [2]

Alemi, Ian Fischer, Joshua V

Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. InInternational Conference on Learning Representations, Toulon, France, 2017. Poster

work page 2017

[3] [3]

Alemi, Ben Poole, Ian Fischer, Joshua V

Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V. Dillon, Rif A. Saurous, and Kevin Murphy. Fixing a broken ELBO. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 159–168, 2018

work page 2018

[4] [4]

Richard E. Blahut. Computation of channel capacity and rate-distortion functions.IEEE Transactions on Information Theory, 18(4):460–473, 1972

work page 1972

[5] [5]

Information bottleneck for gaussian variables.Journal of Machine Learning Research, 6(6):165–188, 2005

Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables.Journal of Machine Learning Research, 6(6):165–188, 2005. 8

work page 2005

[6] [6]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, Hoboken, NJ, 2 edition, 2006

work page 2006

[7] [7]

Springer, New York, 2 edition, 2002

Olav Kallenberg.Foundations of Modern Probability. Springer, New York, 2 edition, 2002

work page 2002

[8] [8]

Gaussian lower bound for the information bottleneck limit.Journal of Machine Learning Research, 18(213):1–29, 2018

Amichai Painsky and Naftali Tishby. Gaussian lower bound for the information bottleneck limit.Journal of Machine Learning Research, 18(213):1–29, 2018

work page 2018

[9] [9]

Information- based clustering.Proceedings of the National Academy of Sciences of the United States of America, 102(51):18297–18302, 2005

Noam Slonim, Gurinder Singh Atwal, Gaˇ sper Tkaˇ cik, and William Bialek. Information- based clustering.Proceedings of the National Academy of Sciences of the United States of America, 102(51):18297–18302, 2005

work page 2005

[10] [10]

Naftali Tishby, Fernando C. N. Pereira, and William Bialek. The information bottleneck method. InProceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, pages 368–377, Monticello, IL, 1999. A Notation summary Symbol Meaning T,C,Xsource, relevance variable, representation φsufficient statistic mapφ:T → Z IT,C(R) IB info...

work page 1999