arxiv: 2605.03636 · v1 · submitted 2026-05-05 · 💻 cs.LG

Recognition: unknown

Information Plane Analysis of Binary Neural Networks

Maximilian Nothnagel , Bernhard C. Geiger

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords information plane analysismutual information estimationbinary neural networkscompression phasegeneralizationfinite-sample biasplug-in estimator

0 comments

The pith

Binary neural networks show frequent late-stage compression in their information planes, yet this does not reliably improve generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies information plane analysis on binary neural networks, where discrete activations make mutual information finite and easier to estimate from samples. It first maps out when the plug-in estimator produces trustworthy mutual information values by testing how sample size and representation dimension affect bias and saturation. In reliable regimes the authors train hundreds of binary networks and observe that compression phases late in training are common, but these compressed representations do not produce better test accuracy in a consistent way; the connection instead changes with the task, network shape, and regularizer in use.

Core claim

Restricting analysis to regimes where mutual information estimates from binary activations remain reliable, experiments with 375 binary neural networks show that late-stage compression in the information plane occurs frequently, yet compressed representations do not consistently correlate with improved generalization performance; the relationship instead depends strongly on task, architecture, and regularization.

What carries the argument

Plug-in mutual information estimator applied to discrete binary activations, with identified regimes of sample size N and dimension D that keep estimates below saturation at log2 N.

If this is right

Late-stage compression phases appear often during training of binary neural networks.
Compressed latent representations in the information plane do not produce better generalization in every setting.
The compression-generalization link changes with the chosen task, network architecture, and regularization method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If better estimators become available, similar information-plane checks could be applied to ordinary real-valued networks.
The strong dependence on regularization hints that compression may be an artifact of particular training choices rather than a universal training principle.
Model selection routines that reward compression should be tested separately for each task and architecture before adoption.

Load-bearing premise

The plug-in estimator's finite-sample bias depends only on sample size and representation dimension without further assumptions on the data distribution or training dynamics.

What would settle it

Observing mutual information values that stay well below log2 N for sample sizes and dimensions outside the claimed reliable regimes, or finding a single task-architecture-regularization combination where late compression consistently predicts higher test accuracy.

Figures

Figures reproduced from arXiv: 2605.03636 by Bernhard C. Geiger, Maximilian Nothnagel.

**Figure 1.** Figure 1: Evaluation of the plug-in estimate Hˆ with N = 1,000 samples of D-dimensional Bernoulli vectors of varying success probability p. Data is averaged over 20 experiments. The dotted line represents the true entropy H. It can be seen that while the true entropy increases linearly with dimensionality D, the estimate Hˆ starts diverging at D ≈ 8 and eventually saturates at log2 N. Since in the BNNs in our study … view at source ↗

**Figure 2.** Figure 2: Compression factor ϱ computed per experiment run and layer, grouped by dataset and experiment group. Each dot is coloured according to the weight decay coefficient λ used in its experiment. Layer indices are given as negative offset starting from the output layer, i.e. layer -1 represents the last hidden layer. being able to explain mediocre performance on hard problems in general. All selected experiments in view at source ↗

**Figure 3.** Figure 3: Select IPs of trained models. All figures show the last (and second to last in (b)) hidden FC layer, all with view at source ↗

**Figure 4.** Figure 4: Comparison of the MI of the penultimate layer w.r.t. the input data view at source ↗

read the original abstract

Information plane (IP) analysis has been suggested to study the training dynamics of deep neural networks through mutual information (MI) between inputs, representations, and targets. However, its statistical validity is often compromised by the difficulty of estimating MI from samples of high-dimensional, deterministic representations. In this work, we perform IP analyses on binary neural networks (BNNs) where activations are discrete and MI is finite. We characterise the finite-sample behaviour of the plug-in entropy estimator and identify regimes for sample size $N$ and representation dimensionality $D$ under which MI estimates are reliable. Outside these regimes, we show that empirical MI estimates saturate to $\log_2 N$, rendering IP trajectories uninformative. Restricting attention to the reliable regime, we train 375 BNNs to investigate the existence of late-stage compression phases and the relationship between compressed representations and generalisation performance. Our results show that while late-stage compression is frequently observed, compressed latent representations do not consistently correlate with improved generalization performance. Instead, the relationship between compression and generalisation is highly dependent on task, architecture, and regularisation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper performs information plane analysis on binary neural networks (BNNs) with discrete activations, where mutual information is finite. It characterizes the finite-sample behavior of the plug-in entropy estimator to identify regimes of sample size N and representation dimensionality D under which MI estimates are reliable; outside these, estimates saturate to log2 N. Restricting to reliable regimes, the authors train 375 BNNs and report that late-stage compression is frequently observed, but compressed latent representations do not consistently correlate with improved generalization performance; the relationship depends on task, architecture, and regularization.

Significance. If the regime characterization is robust, the work provides a useful cautionary result for IP analyses in deep learning by showing that apparent compression can be an estimator artifact. The empirical finding that compression is common yet uncorrelated with generalization (in a controlled BNN setting) challenges assumptions in the literature about the role of compression phases.

major comments (1)

[Section characterizing finite-sample behaviour of the plug-in estimator] The identification of reliable MI regimes is performed as a function of N and D alone (see the section characterizing finite-sample behaviour of the plug-in estimator). For binary activations the plug-in estimator bias depends on the specific probability mass function over the 2^D patterns; these masses are determined by the input data distribution and the weights reached during training. The manuscript does not demonstrate that regime boundaries are insensitive to these factors, so trajectories currently treated as reliable may still be biased toward log2 N saturation. This directly affects the central claim that late-stage compression is frequently observed yet uncorrelated with generalization.

minor comments (1)

[Abstract] The abstract states that 375 BNNs were trained but provides no summary table or description of the distribution over architectures, tasks, and regularization strengths; this makes it difficult to assess how representative the 'highly dependent' conclusion is.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting an important statistical consideration in our analysis of the plug-in estimator. We address the major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Section characterizing finite-sample behaviour of the plug-in estimator] The identification of reliable MI regimes is performed as a function of N and D alone (see the section characterizing finite-sample behaviour of the plug-in estimator). For binary activations the plug-in estimator bias depends on the specific probability mass function over the 2^D patterns; these masses are determined by the input data distribution and the weights reached during training. The manuscript does not demonstrate that regime boundaries are insensitive to these factors, so trajectories currently treated as reliable may still be biased toward log2 N saturation. This directly affects the central claim that late-stage compression is frequently observed yet uncorrelated with generalization.

Authors: We agree that the finite-sample bias of the plug-in entropy estimator for binary representations depends on the specific PMF over the 2^D patterns and is therefore influenced by the input distribution and the weights learned during training. Our characterization combined theoretical bounds (derived under the worst-case uniform distribution to obtain conservative regime limits) with empirical simulations on synthetic binary patterns. While these steps provide a practical guide for reliable estimation, we did not explicitly verify that the identified (N, D) boundaries remain stable under the exact empirical PMFs arising in trained BNNs. To resolve this, we will add an appendix containing additional experiments that (i) extract the observed activation histograms from the 375 trained networks and (ii) re-evaluate estimator behavior under those distributions. The revised manuscript will report that the regime boundaries shift only marginally, thereby confirming that the trajectories we classified as reliable are not subject to log2 N saturation. This addition will directly bolster the central empirical claim that late-stage compression is common yet does not consistently correlate with generalization performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical IP analysis of BNNs

full rationale

The paper's derivation chain consists of an empirical characterization of the plug-in estimator's finite-sample behavior (saturation to log2 N when N << 2^D follows directly from counting distinct samples in the estimator definition) followed by direct observations on 375 trained BNNs within the identified reliable regimes. No load-bearing step reduces to a self-definition, fitted input renamed as prediction, or self-citation chain; the central claims on late-stage compression frequency and its task-dependent relation to generalization are standalone experimental results without imported uniqueness theorems or ansatzes. The work is self-contained as an empirical investigation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the standard definition of mutual information and the assumption that the plug-in estimator's bias is governed only by sample size N and representation dimension D. No new entities are introduced.

axioms (2)

standard math Mutual information between discrete random variables is finite and can be estimated via the plug-in entropy estimator.
Invoked throughout the abstract as the basis for IP analysis.
domain assumption The finite-sample behaviour of the plug-in estimator is fully characterised by regimes of N and D.
Stated as the condition under which MI estimates are reliable.

pith-pipeline@v0.9.0 · 5488 in / 1408 out tokens · 37998 ms · 2026-05-07T16:51:38.003337+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Opening the Black Box of Deep Neural Networks via Information

R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,”arXiv:1703.00810 [cs.LG], Mar. 2017

work page Pith review arXiv 2017
[2]

On Information Plane Analyses of Neural Network Classifiers—A Review,

B. C. Geiger, “On Information Plane Analyses of Neural Network Classifiers—A Review,”IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 12, pp. 7039–7051, Dec. 2022

2022
[3]

Information bottleneck: Exact analysis of (quantized) neural networks,

S. S. Lorenzen, C. Igel, and M. Nielsen, “Information bottleneck: Exact analysis of (quantized) neural networks,”arXiv:2106.12912 [cs.LG], Jun. 2021

work page arXiv 2021
[4]

On the information bottleneck theory of deep learning,

A. M. Saxe, Y . Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox, “On the information bottleneck theory of deep learning,”Journal of Statistical Mechanics: Theory and Experiment, vol. 2019, no. 12, p. 124020, Dec. 2019. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. AA, NO. BB, MMMM YYYY 8

2019
[5]

Learning representations for neural network-based classification using the information bottleneck principle,

R. A. Amjad and B. C. Geiger, “Learning representations for neural network-based classification using the information bottleneck principle,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 9, pp. 2225–2239, Sep. 2020, open-access:arXiv:1802.09766 [cs.LG]

work page arXiv 2020
[6]

Estimating information flow in deep neural networks,

Z. Goldfeld, E. Van Den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y . Polyanskiy, “Estimating information flow in deep neural networks,” pp. 2299–2308, Jun. 2019

2019
[7]

BinaryConnect: Training Deep Neural Networks with binary weights during propagations,

M. Courbariaux, Y . Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations,” in Proc. Advances in Neural Information Processing Systems, 2015

2015
[8]

Quantized Neural Networks: Training Neural Networks with Low Pre- cision Weights and Activations,

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio, “Quantized Neural Networks: Training Neural Networks with Low Pre- cision Weights and Activations,”Journal of Machine Learning Research, vol. 18, no. 187, pp. 1–30, 2018

2018
[9]

Convergence properties of functional estimates for discrete distributions,

A. Antos and I. Kontoyiannis, “Convergence properties of functional estimates for discrete distributions,”Random Structures & Algorithms, vol. 19, no. 3-4, pp. 163–193, Oct. 2001

2001
[10]

Markov Information Bottleneck to Improve Information Flow in Stochastic Neural Networks,

T. Nguyen-Tang and J. Choi, “Markov Information Bottleneck to Improve Information Flow in Stochastic Neural Networks,”Entropy, vol. 21, no. 10, p. 976, Oct. 2019

2019
[11]

Understanding learning dy- namics of binary neural networks via information bottleneck,

V . Raj, N. Nayak, and S. Kalyani, “Understanding learning dy- namics of binary neural networks via information bottleneck,” arXiv:2006.07522 [cs.LG], Jun. 2020

work page arXiv 2006
[12]

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,” inProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp. 2704–2713

2018
[13]

Neural Networks for Machine Learning,

G. Hinton, “Neural Networks for Machine Learning,” University of Toronto, 2012

2012
[14]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y . Bengio, N. L ´eonard, and A. Courville, “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,” Aug. 2013,arXiv:1308.3432 [cs]

work page internal anchor Pith review arXiv 2013
[15]

Minimax rates of entropy estimation on large alphabets via best polynomial approximation,

Y . Wu and P. Yang, “Minimax rates of entropy estimation on large alphabets via best polynomial approximation,”IEEE Trans. Inf. Theory, vol. 62, no. 6, pp. 3702–3720, 2016

2016
[16]

Estimation of Entropy and Mutual Information,

L. Paninski, “Estimation of Entropy and Mutual Information,”Neural Computation, vol. 15, no. 6, pp. 1191–1253, Jun. 2003

2003