Deep Variational Information Bottleneck
Pith reviewed 2026-05-18 06:18 UTC · model grok-4.3
The pith
A variational approximation to the information bottleneck lets neural networks learn compressed yet predictive representations that generalize better and resist adversarial attacks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a variational approximation to the information bottleneck of Tishby et al. (1999). This variational approach allows us to parameterize the information bottleneck model using a neural network and leverage the reparameterization trick for efficient training. We call this method Deep Variational Information Bottleneck, or Deep VIB. We show that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.
What carries the argument
The Deep VIB objective, a variational upper bound on the information-bottleneck Lagrangian that replaces the mutual-information terms with expectations under parameterized encoder and decoder distributions.
If this is right
- Neural networks achieve higher accuracy on held-out test data than networks trained with dropout or weight decay.
- The learned representations exhibit greater robustness to adversarial perturbations crafted to maximize prediction error.
- A single scalar beta directly controls the amount of compression applied to the input representation.
- The method supports fully end-to-end training of deep architectures without requiring separate pre-training stages.
Where Pith is reading between the lines
- The same variational bound could be adapted to sequential or graph-structured data where explicit compression of history or neighborhood information is desirable.
- Success of VIB on adversarial robustness suggests that many existing regularizers may be implicitly performing a similar information-compression role.
- Combining the objective with modern data-augmentation pipelines might further widen the robustness gap observed in the paper.
Load-bearing premise
The variational bounds on the mutual information terms stay tight enough during training that the learned representation actually realizes the intended compression-prediction trade-off.
What would settle it
Measure the true mutual informations I(X;Z) and I(Z;Y) after training and check whether they vary with the beta parameter exactly as the information-bottleneck curve predicts, or run head-to-head comparisons on multiple datasets where VIB fails to beat standard regularizers.
read the original abstract
We present a variational approximation to the information bottleneck of Tishby et al. (1999). This variational approach allows us to parameterize the information bottleneck model using a neural network and leverage the reparameterization trick for efficient training. We call this method "Deep Variational Information Bottleneck", or Deep VIB. We show that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a variational approximation to the Information Bottleneck (IB) principle of Tishby et al. (1999), called Deep Variational Information Bottleneck (VIB). It parameterizes the IB model with neural networks for the encoder q(z|x) and decoder p(y|z), applies the reparameterization trick, and optimizes a variational surrogate to the IB Lagrangian max I(Y;Z) - β I(X;Z). The central claim is that VIB-trained models outperform those trained with other regularizers on generalization and adversarial robustness.
Significance. If the variational bounds remain sufficiently tight and the method truly implements the intended IB trade-off, this supplies a practical, scalable realization of information-theoretic regularization for deep networks. The approach could influence regularization techniques and robustness research by providing a principled alternative to ad-hoc penalties.
major comments (3)
- [§2, Eq. (3)] The derivation in §2 applies the standard variational lower bound to the IB objective, yielding the loss E_{q(z|x)}[-log p(y|z)] + β KL(q(z|x)||r(z)). However, no diagnostic is provided (e.g., estimated mutual information curves or bound-gap plots) to confirm that the upper bound on I(X;Z) and lower bound on I(Y;Z) stay tight throughout optimization; if loose, the reported gains may arise from the specific KL regularizer rather than IB compression.
- [Table 1, §4.1] Table 1 and §4.1 report superior MNIST generalization for VIB over dropout and weight decay, but the comparison does not control for hyper-parameter search budget across methods. Without this, it is unclear whether the advantage is attributable to the IB principle or to differences in tuning effort.
- [§4.3] §4.3 claims improved adversarial robustness, yet the evaluation uses a fixed attack strength without reporting sensitivity to stronger attacks or providing the exact attack parameters. This leaves open whether the robustness is a genuine consequence of the information bottleneck or an artifact of the chosen evaluation.
minor comments (2)
- [§2] The notation for the variational prior r(z) versus the marginal p(z) should be made consistent across equations to avoid reader confusion.
- [Figure 1] Figure 1 would benefit from axis labels that explicitly state the quantities plotted (e.g., estimated I(X;Z) versus β) and from reporting results over multiple random seeds.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [§2, Eq. (3)] The derivation in §2 applies the standard variational lower bound to the IB objective, yielding the loss E_{q(z|x)}[-log p(y|z)] + β KL(q(z|x)||r(z)). However, no diagnostic is provided (e.g., estimated mutual information curves or bound-gap plots) to confirm that the upper bound on I(X;Z) and lower bound on I(Y;Z) stay tight throughout optimization; if loose, the reported gains may arise from the specific KL regularizer rather than IB compression.
Authors: We agree that diagnostics on bound tightness would strengthen the paper. In the revision we will add plots of the estimated mutual information terms I(X;Z) and I(Y;Z) together with the variational gap throughout training. These will be computed using the same Monte-Carlo estimators already present in the code and will help confirm that the reported gains track the intended IB trade-off. revision: yes
-
Referee: [Table 1, §4.1] Table 1 and §4.1 report superior MNIST generalization for VIB over dropout and weight decay, but the comparison does not control for hyper-parameter search budget across methods. Without this, it is unclear whether the advantage is attributable to the IB principle or to differences in tuning effort.
Authors: This is a fair criticism. While we performed grid searches of comparable size for all methods, we did not explicitly equalize total wall-clock budget. In the revised manuscript we will report the exact hyper-parameter ranges explored for each baseline and add a short discussion of search effort. A fully re-tuned matched-budget experiment is beyond the scope of a minor revision but can be noted as future work if the referee requests it. revision: partial
-
Referee: [§4.3] §4.3 claims improved adversarial robustness, yet the evaluation uses a fixed attack strength without reporting sensitivity to stronger attacks or providing the exact attack parameters. This leaves open whether the robustness is a genuine consequence of the information bottleneck or an artifact of the chosen evaluation.
Authors: We accept the point. The original experiments used FGSM with ε = 0.3 (standard at the time) but omitted full parameter disclosure and sensitivity curves. The revision will state the precise attack parameters, include results for a range of ε values, and add a brief comparison with PGD to show that the robustness advantage persists under stronger attacks. revision: yes
Circularity Check
No significant circularity: standard variational approximation to external IB objective
full rationale
The paper starts from the information bottleneck Lagrangian of Tishby et al. (1999), an external reference, and applies the standard variational upper bound on I(X;Z) via KL(q(z|x)||r(z)) together with a lower bound on I(Y;Z) via the decoder expectation. This produces a tractable objective that is then optimized with neural networks and the reparameterization trick. Neither the derivation nor the empirical performance claims reduce to a fitted parameter, self-definition, or self-citation chain; the bounds are explicit approximations whose tightness is an empirical question rather than a definitional identity. Experiments compare against other regularizers on held-out data, providing independent content.
Axiom & Free-Parameter Ledger
free parameters (1)
- beta
axioms (1)
- domain assumption A variational distribution q(z|x) can be used to obtain a tractable lower bound on the mutual information I(X;Z).
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
I(Z,Y;θ)−βI(Z,X;θ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
-
Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection
A new orthogonal projection module for video anomaly detection suppresses facial attributes via weak face-presence signals and cosine alignment while preserving anomaly-relevant features like pose and motion.
-
From Observations to States: Latent Time Series Forecasting
LatentTSF improves time series forecasting accuracy and representation quality by shifting prediction from observation space to a learned latent state space via autoencoding.
-
Information Filtering via Variational Regularization for Robot Manipulation
Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld whil...
-
Dream to Control: Learning Behaviors by Latent Imagination
Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
-
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.
-
Hypergraph and Latent ODE Learning for Multimodal Root Cause Localization in Microservices
HyperODE RCA integrates hypergraph learning with latent ODEs and cross-modal attention to improve root cause localization in microservice architectures on the Tianchi AIOps benchmark.
-
Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data
DySIB recovers a two-dimensional representation matching the phase space of a physical pendulum from high-dimensional video data by maximizing predictive mutual information in latent space.
-
Variational Feature Compression for Model-Specific Representations
A variational latent bottleneck with KL regularization and a dynamic binary mask based on saliency produces model-specific features that keep high accuracy for one classifier but drop others below 2% on CIFAR-100 with...
-
Super Agents and Confounders: Influence of surrounding agents on vehicle trajectory prediction
Surrounding agents frequently degrade trajectory prediction accuracy in interactive driving scenes, and integrating a Conditional Information Bottleneck improves results by ignoring non-beneficial contextual signals.
-
TabTransformer: Tabular Data Modeling Using Contextual Embeddings
TabTransformer uses Transformer self-attention to generate contextual embeddings from categorical features in tabular data, outperforming prior deep learning methods by at least 1% mean AUC and matching tree-based ens...
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
-
URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection
URMF uses learnable Gaussian posteriors to estimate modality-specific uncertainty and adjust fusion weights for improved multimodal sarcasm detection on MSD and MMSD2 benchmarks.
-
DRAFT: Task Decoupled Latent Reasoning for Agent Safety
DRAFT decouples agent safety judgment into latent extraction and reasoning stages, raising average benchmark accuracy from 63.27% to 91.18%.
-
Structural Prognostic Event Modeling for Multimodal Cancer Survival Analysis
SlotSPE is a slot-attention framework that decomposes multimodal cancer data into structural prognostic event slots to improve survival prediction and interpretability.
-
TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination
TALE selectively prunes task-detrimental layers in LLMs at inference time to match or exceed baseline performance with lower computational cost across multiple models and tasks.
-
Adversary-Free Counterfactual Prediction via Information-Regularized Representations
Develops an adversary-free counterfactual prediction framework by deriving a variational objective that upper-bounds mutual information between stochastic representations and treatments.
Reference graph
Works this paper leans on
-
[1]
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Mart \' n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Information Dropout: Learning Optimal Representations Through Noisy Computation
Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. 2016. URL http://arxiv.org/abs/1611.01353
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
The IM algorithm: a variational approach to information maximization
David Barber Felix Agakov. The IM algorithm: a variational approach to information maximization . In NIPS, volume 16, 2004
work page 2004
-
[4]
The virtues of peer pressure: A simple method for discovering high-value mistakes
Shumeet Baluja, Michele Covell, and Rahul Sukthankar. The virtues of peer pressure: A simple method for discovering high-value mistakes. In Intl. Conf. Computer Analysis of Images and Patterns, 2015
work page 2015
-
[5]
Towards open world recognition
Abhijit Bendale and Terrance Boult. Towards open world recognition. In CVPR, 2015
work page 2015
-
[6]
Predictability, complexity, and learning
William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, complexity, and learning. Neural computation, 13 0 (11): 0 2409--2463, 2001
work page 2001
-
[7]
Weight uncertainty in neural networks
Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In ICML, 2015
work page 2015
-
[8]
Ryan P. Browne and Paul D. McNicholas. Multivariate sharp quadratic bounds via -strong convexity and the fenchel connection. Electronic Journal of Statistics, 9, 2015
work page 2015
-
[9]
Towards evaluating the robustness of neural networks
Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. Arxiv, 2016
work page 2016
-
[10]
Relevant sparse codes with variational information bottleneck
Matthew Chalk, Olivier Marre, and Gasper Tkacik. Relevant sparse codes with variational information bottleneck. In NIPS, 2016
work page 2016
-
[11]
G. Chechik, A Globersonand N. Tishby, and Y. Weiss. Information bottleneck for gaussian variables. J. of Machine Learning Research, 6: 0 165–188, 2005
work page 2005
-
[12]
Differential privacy as a mutual information constraint
Paul Cuff and Lanqing Yu. Differential privacy as a mutual information constraint. In ACM Conference on Computer and Communications Security ( CCS ) , 2016
work page 2016
-
[13]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp.\ 248--255. IEEE, 2009
work page 2009
-
[14]
Robustness of classifiers: from adversarial to random noise
Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of classifiers: from adversarial to random noise. In NIPS, 2016
work page 2016
-
[15]
Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AI/Statistics, volume 9, pp.\ 249--256, 2010
work page 2010
-
[16]
Explaining and harnessing adversarial examples
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015
work page 2015
-
[17]
beta-VAE : Learning basic visual concepts with a constrained variational framework
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE : Learning basic visual concepts with a constrained variational framework. In ICLR , 2017. URL https://openreview.net/pdf?id=Sy2fzU9gl
work page 2017
-
[18]
Learning with a Strong Adversary
Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba Szepesv \' a ri. Learning with a strong adversary. CoRR, abs/1511.03034, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[19]
Adam: A method for stochastic optimization
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015
work page 2015
-
[20]
Auto-encoding variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational Bayes . In ICLR, 2014
work page 2014
-
[21]
Adversarial examples in the physical world
Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In ICLR Workshop , 2017. URL https://openreview.net/pdf?id=S1OufnIlx
work page 2017
-
[22]
The Variational Fair Autoencoder
Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoencoder. In ICLR , 2016. URL http://arxiv.org/abs/1511.00830
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
Information theory, inference and learning algorithms
David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003
work page 2003
-
[24]
Variational information maximisation for intrinsically motivated reinforcement learning
Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In NIPS, pp.\ 2125--2133, 2015
work page 2015
-
[25]
Universal adversarial perturbations
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. Arxiv, 2016
work page 2016
-
[26]
Deepfool: a simple and accurate method to fool deep neural networks
Seyed - Mohsen Moosavi - Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In CVPR, 2016
work page 2016
-
[27]
Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images
Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR, 2015. URL http://arxiv.org/abs/1412.1897
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[28]
Predictive information in a sensory population
Stephanie E Palmer, Olivier Marre, Michael J Berry, and William Bialek. Predictive information in a sensory population. PNAS, 112 0 (22): 0 6908--6913, 2015
work page 2015
-
[29]
The limitations of deep learning in adversarial settings
Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Proceedings of the 1st IEEE European Symposium on Security and Privacy, 2015
work page 2015
-
[30]
Regularizing neural networks by penalizing confident output predictions
Gabriel Pereyra, George Tuckery, Jan Chorowski, and Lukasz Kaiser. Regularizing neural networks by penalizing confident output predictions. In ICLR Workshop , 2017. URL https://openreview.net/pdf?id=HyhbYrGYe
work page 2017
-
[31]
Acceleration of stochastic approximation by averaging
Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30 0 (4): 0 838--855, 1992
work page 1992
-
[32]
Confusing Deep Convolution Networks by Relabelling
Leigh Robinson and Benjamin Graham. Confusing deep convolution networks by relabelling. arXiv preprint 1510.06925, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[33]
Adversarial manipulation of deep representations
Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J Fleet. Adversarial manipulation of deep representations. In ICLR, 2016
work page 2016
-
[34]
Learning and generalization with the information bottleneck
Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411 0 (29-30): 0 2696--2711, 2010
work page 2010
-
[35]
Noam Slonim, Gurinder Singh Atwal, Ga s per Tka c ik, and William Bialek. Information-based clustering. PNAS, 102 0 (51): 0 18297--18302, 2005
work page 2005
-
[36]
How many clusters? an information-theoretic perspective
Susanne Still and William Bialek. How many clusters? an information-theoretic perspective. Neural computation, 16 0 (12): 0 2483--2506, 2004
work page 2004
-
[37]
Intriguing properties of neural networks
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014. URL http://arxiv.org/abs/1312.6199
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[38]
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[39]
Deep learning and the information bottleneck principle
N Tishby and N Zaslavsky. Deep learning and the information bottleneck principle. In IEEE Information Theory Workshop , pp.\ 1--5, April 2015 a
work page 2015
-
[40]
N. Tishby, F.C. Pereira, and W. Biale. The information bottleneck method. In The 37th annual Allerton Conf. on Communication, Control, and Computing, pp.\ 368--377, 1999
work page 1999
-
[41]
Deep learning and the information bottleneck principle
Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE, pp.\ 1--5. IEEE, 2015 b
work page 2015
-
[42]
On the relation between identifiability, differential privacy and Mutual-Information privacy
Weina Wang, Lei Ying, and Junshan Zhang. On the relation between identifiability, differential privacy and Mutual-Information privacy. IEEE Trans. Inf. Theory, 62: 0 5018--5029, 2016 a
work page 2016
-
[43]
Deep Variational Canonical Correlation Analysis
Weiran Wang, Honglak Lee, and Karen Livescu. Deep variational canonical correlation analysis. arXiv [cs.LG], 11 October 2016 b . URL https://arxiv.org/abs/1610.03454
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.