pith. sign in

arxiv: 2604.26744 · v1 · submitted 2026-04-29 · 💻 cs.IT · math.IT· stat.ML

A Sufficient-Statistic Reduction of the Information Bottleneck to a Low-Dimensional Problem

Pith reviewed 2026-05-07 12:39 UTC · model grok-4.3

classification 💻 cs.IT math.ITstat.ML
keywords information bottlenecksufficient statisticdimensionality reductionGaussian information bottleneckinformation theoryrate-distortion
0
0 comments X

The pith

If the target conditional factors through a sufficient statistic, the Information Bottleneck problem on the full source is exactly equivalent to the problem on the reduced statistic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that whenever p(C|T) equals p(C|φ(T)) for some function φ, every optimal representation and every point on the IB tradeoff curve for the pair (T,C) is recovered exactly by solving the same IB problem on the pair (φ(T),C). The equivalence is lossless: the minimal mutual information values match at every trade-off parameter, and the optimal mappings differ only by composition with φ. As a direct result, the cost of computing the IB curve is governed by the dimension of φ rather than the ambient dimension of T. The classical closed-form Gaussian IB solution emerges immediately as the special case in which φ is the identity on the mean and covariance, and a nonlinear-Gaussian extension follows the same route.

Core claim

If p(C|T) factors through a sufficient statistic φ(T), then the Information Bottleneck functional I(T;R) − η I(R;C) attains the same minimum value for every η when R is optimized over functions of T or over functions of φ(T), and the optimal representations are related by pullback through φ; consequently the entire IB curve is identical for the original and reduced problems.

What carries the argument

The sufficient statistic φ(T) for the target C, defined by the factorization p(C|T) = p(C|φ(T)), which collapses the source while preserving all information relevant to the IB objective.

If this is right

  • The complexity of solving the IB problem scales with the dimension of the sufficient statistic rather than the dimension of the original source.
  • The known closed-form Gaussian IB solution is recovered as the special case in which φ extracts the sufficient statistics of the Gaussian conditional.
  • The same reduction immediately yields a nonlinear-Gaussian generalization of the IB solution.
  • When a low-dimensional sufficient statistic is known, the exact IB curve can be computed by solving the reduced problem at cost set by that statistic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reduction supplies an exact structural condition that separates tractable from intractable IB instances.
  • The same factorization argument may apply to other information-theoretic objectives that depend on the source only through its conditional on a target.
  • In settings where approximate sufficient statistics can be learned, the reduction offers a route to approximate but still exact-on-the-curve IB solutions.

Load-bearing premise

There exists a function φ such that the conditional distribution of the target given the source depends on the source only through φ.

What would settle it

An explicit pair of distributions where p(C|T) = p(C|φ(T)) yet the minimal IB Lagrangian value for (T,C) differs from the value for (φ(T),C) at some η.

read the original abstract

We show that if the conditional distribution p(C | T) factors through a sufficient statistic {\phi}(T), then the Information Bottleneck (IB) problem for (T, C) is exactly equivalent to the IB problem for ({\phi}(T), C). The reduction is loss-free: it preserves the full IB curve, the Lagrangian optimum at every trade-off parameter \b{eta}, and the optimal representations up to pullback through {\phi}. As a result, the computational complexity of solving the IB problem is governed by the dimension of the sufficient statistic rather than the ambient dimension of the source. This identifies an exact structural condition under which the generic IB problem becomes tractable, and gives a formal bridge between the discrete and linear-Gaussian regimes. We then show that the classical Gaussian IB solution of Chechik, Globerson, Tishby and Weiss is an immediate corollary of this reduction, and we state a nonlinear-Gaussian generalisation. A small numerical example illustrates the practical consequence: when a low-dimensional sufficient statistic is available, the exact IB curve can be computed on the reduced problem at a cost determined by the statistic rather than by the ambient source dimension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript shows that if p(C|T) factors through a sufficient statistic φ(T), then the Information Bottleneck problem for the pair (T,C) is exactly equivalent to the IB problem for (φ(T),C). The equivalence is loss-free: it preserves the full IB curve, the Lagrangian optimum for every trade-off parameter η, and the optimal representations up to pullback through φ. The authors recover the classical linear-Gaussian IB solution of Chechik et al. as an immediate corollary, state a nonlinear-Gaussian generalization, and illustrate the computational reduction with a small numerical example.

Significance. If the central reduction holds, the result is significant: it supplies an exact structural condition under which the generally intractable IB problem reduces to a problem whose complexity is governed by the dimension of the sufficient statistic rather than the ambient dimension of T. This provides a formal bridge between discrete and linear-Gaussian regimes and explains why the Gaussian IB admits a closed-form solution. The loss-free preservation of the entire curve and all optima strengthens the practical utility of the reduction for high-dimensional sources.

minor comments (2)
  1. Abstract: the claim that the reduction 'identifies an exact structural condition under which the generic IB problem becomes tractable' should be qualified by noting that tractability remains governed by the cardinality or dimension of φ(T) itself.
  2. The nonlinear-Gaussian generalisation is announced but its precise statement and derivation are not visible in the provided abstract; a short proof sketch or explicit statement of the resulting objective would aid verification.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our manuscript, the assessment of its significance, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core reduction follows directly from the standard definition of a sufficient statistic (p(C|T) = p(C|φ(T))) and the chain-rule decomposition of mutual information I(T;Z) = I(φ(T);Z) + I(T;Z|φ(T)) ≥ I(φ(T);Z), both of which are external mathematical facts independent of the paper. The equivalence of the full IB curve, Lagrangian optima, and optimal representations is a straightforward consequence of the induced Markov chain C ⊥ T | φ(T) and does not involve any fitted parameters, self-referential predictions, or load-bearing self-citations. The Gaussian IB result is recovered as a corollary from prior independent work (Chechik et al.), and the nonlinear-Gaussian extension is stated without circularity. No step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the standard definition of sufficient statistics and the Lagrangian formulation of IB, both taken from prior information-theory literature; no free parameters, new entities, or ad-hoc assumptions are introduced in the abstract.

axioms (2)
  • standard math Definition of sufficient statistic: φ(T) is sufficient for C if p(C|T) = p(C|φ(T))
    Invoked directly in the statement that p(C|T) factors through φ(T)
  • domain assumption IB objective is the Lagrangian I(T;Z) - η I(Z;C) minimized over representations Z
    The equivalence is claimed to hold for the full curve and every η

pith-pipeline@v0.9.0 · 5503 in / 1515 out tokens · 106634 ms · 2026-05-07T12:39:02.711234+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    Emergence of invariance and disentanglement in deep representations.Journal of Machine Learning Research, 19(50):1–34, 2018

    Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations.Journal of Machine Learning Research, 19(50):1–34, 2018

  2. [2]

    Alemi, Ian Fischer, Joshua V

    Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. InInternational Conference on Learning Representations, Toulon, France, 2017. Poster

  3. [3]

    Alemi, Ben Poole, Ian Fischer, Joshua V

    Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V. Dillon, Rif A. Saurous, and Kevin Murphy. Fixing a broken ELBO. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 159–168, 2018

  4. [4]

    Richard E. Blahut. Computation of channel capacity and rate-distortion functions.IEEE Transactions on Information Theory, 18(4):460–473, 1972

  5. [5]

    Information bottleneck for gaussian variables.Journal of Machine Learning Research, 6(6):165–188, 2005

    Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables.Journal of Machine Learning Research, 6(6):165–188, 2005. 8

  6. [6]

    Cover and Joy A

    Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, Hoboken, NJ, 2 edition, 2006

  7. [7]

    Springer, New York, 2 edition, 2002

    Olav Kallenberg.Foundations of Modern Probability. Springer, New York, 2 edition, 2002

  8. [8]

    Gaussian lower bound for the information bottleneck limit.Journal of Machine Learning Research, 18(213):1–29, 2018

    Amichai Painsky and Naftali Tishby. Gaussian lower bound for the information bottleneck limit.Journal of Machine Learning Research, 18(213):1–29, 2018

  9. [9]

    Information- based clustering.Proceedings of the National Academy of Sciences of the United States of America, 102(51):18297–18302, 2005

    Noam Slonim, Gurinder Singh Atwal, Gaˇ sper Tkaˇ cik, and William Bialek. Information- based clustering.Proceedings of the National Academy of Sciences of the United States of America, 102(51):18297–18302, 2005

  10. [10]

    Naftali Tishby, Fernando C. N. Pereira, and William Bialek. The information bottleneck method. InProceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, pages 368–377, Monticello, IL, 1999. A Notation summary Symbol Meaning T,C,Xsource, relevance variable, representation φsufficient statistic mapφ:T → Z IT,C(R) IB info...