Learning to Find Correlated Features by Maximizing Information Flow in Convolutional Neural Networks

Fei Li; Rujie Liu; Wei Shen

arxiv: 1907.00348 · v1 · pith:V6L6EPT3new · submitted 2019-06-30 · 💻 cs.CV

Learning to Find Correlated Features by Maximizing Information Flow in Convolutional Neural Networks

Wei Shen , Fei Li , Rujie Liu This is my paper

Pith reviewed 2026-05-25 13:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords convolutional neural networksinformation flow maximizationcorrelated featuresimage classificationregularization lossdiscriminative informationshiftedMNIST

0 comments

The pith

Minimizing classification loss causes CNNs to ignore correlated discriminative features; an information flow maximization loss retains more of them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Convolutional neural networks trained with standard classification loss often discard some discriminative information even when it is correlated with the target label. This happens because the loss only drives the model toward the single most discriminative subset of features rather than the full set of useful ones. When test samples depend on the ignored correlated features, accuracy suffers. The paper introduces an information flow maximization loss added as a regularization term to push the network to preserve additional correlated features. Validation on shiftedMNIST shows the resulting models rely on more informative features overall.

Core claim

The minimization of the classification loss does not ensure to learn the overall discriminative information but only the most discriminative information, which causes the discard of correlated discriminative information. The proposed information flow maximization (IFM) loss as a regularization term addresses this by finding the discriminative correlated features so that with less information loss the classifier can make predictions based on more informative features.

What carries the argument

The information flow maximization (IFM) loss, introduced as a regularization term that encourages convolutional networks to retain correlated discriminative features beyond those selected by the classification objective alone.

If this is right

The network learns a larger set of representative and discriminative features instead of only the strongest subset.
Predictions become possible from a wider pool of informative features when test conditions emphasize different members of the correlated set.
Information loss during training is reduced while the primary classification objective remains intact.
The regularization term can be added to existing CNN training pipelines for image classification tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularization idea could be tested on architectures other than CNNs where feature correlations within classes are known to exist.
If the IFM term proves stable, it might reduce reliance on explicit data augmentation designed to surface secondary features.
The method highlights a general tension between loss minimization and completeness of learned representations that appears in other supervised settings.

Load-bearing premise

That the IFM loss will successfully encourage retention of correlated features in practice without degrading overall classification accuracy or requiring dataset-specific tuning beyond the shiftedMNIST validation.

What would settle it

On shiftedMNIST or a similar dataset constructed so that test accuracy depends on the secondary correlated features, models trained with the IFM term show no improvement over standard cross-entropy training.

Figures

Figures reproduced from arXiv: 1907.00348 by Fei Li, Rujie Liu, Wei Shen.

**Figure 3.** Figure 3: Some training examples (a) and test samples (b) from the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Training convolutional neural networks for image classification tasks usually causes information loss. Although most of the time the information lost is redundant with respect to the target task, there are still cases where discriminative information is also discarded. For example, if the samples that belong to the same category have multiple correlated features, the model may only learn a subset of the features and ignore the rest. This may not be a problem unless the classification in the test set highly depends on the ignored features. We argue that the discard of the correlated discriminative information is partially caused by the fact that the minimization of the classification loss doesn't ensure to learn the overall discriminative information but only the most discriminative information. To address this problem, we propose an information flow maximization (IFM) loss as a regularization term to find the discriminative correlated features. With less information loss the classifier can make predictions based on more informative features. We validate our method on the shiftedMNIST dataset and show the effectiveness of IFM loss in learning representative and discriminative features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The IFM loss targets a plausible gap in CNN feature learning but the experiments on only shiftedMNIST leave the core claim untested.

read the letter

The one thing to know is that the IFM regularization targets a real issue in CNN training but hasn't been shown to solve it outside a very specific setup. The paper argues that standard cross-entropy minimization picks up only the strongest discriminative signals and drops correlated ones that could help on test data. They introduce the information flow maximization loss as an add-on term to encourage the network to keep more of that information. This is a straightforward extension of existing regularization ideas in deep learning. What they do well is state the motivation clearly and pick a dataset that demonstrates the problem they describe. ShiftedMNIST forces the model to deal with correlated features from the shifts, so it's a reasonable starting point for validation. The soft spots are in the evaluation. All results come from this one constructed dataset. There's no evidence from natural image tasks, no ablation studies to confirm the loss works through the proposed mechanism rather than some other effect, and no direct measures like feature visualizations or information estimates to show that correlated features are retained. The claim that it finds representative and discriminative features rests on accuracy numbers alone from this narrow case. If the method generalizes, it could be useful for people building more robust classifiers when multiple features matter. But right now the paper reads as an initial idea that needs more work to be convincing. I would not recommend it for peer review in its current form; the authors should add experiments on standard benchmarks and mechanistic checks first.

Referee Report

2 major / 1 minor

Summary. The paper argues that standard cross-entropy minimization in CNNs for image classification learns only the most discriminative features and discards other correlated discriminative information present in the data. It proposes an information flow maximization (IFM) loss as a regularization term to encourage retention of these correlated features, thereby reducing information loss and improving predictions when test data depends on the secondary features. The approach is validated solely on the shiftedMNIST dataset.

Significance. If the IFM regularization can be shown to measurably increase retention of secondary correlated features without degrading primary-task accuracy and without extensive per-dataset tuning, the method would address a plausible limitation of standard training and could improve robustness in distribution-shift scenarios that rely on multiple correlated cues. The manuscript supplies no machine-checked proofs, reproducible code, or parameter-free derivations.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the central claim that cross-entropy minimization discards correlated discriminative information (and that IFM corrects this) is load-bearing, yet the manuscript supplies no mutual-information estimates, feature visualizations, or controlled ablations that isolate whether accuracy changes arise from the claimed mechanism versus other side-effects of the auxiliary loss.
[Experiments] Experiments section: all quantitative results are confined to shiftedMNIST (a dataset constructed precisely to embed explicit digit-shift correlations); no results, ablations, or hyper-parameter sensitivity analysis are reported on any natural-image benchmark, leaving the weakest assumption—that the loss behaves as intended outside this narrow setting—untested.

minor comments (1)

[Abstract] Abstract: the phrase 'with less information loss the classifier can make predictions based on more informative features' is repeated without a precise definition of 'information flow' or how the IFM term is computed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We address the major comments as follows.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that cross-entropy minimization discards correlated discriminative information (and that IFM corrects this) is load-bearing, yet the manuscript supplies no mutual-information estimates, feature visualizations, or controlled ablations that isolate whether accuracy changes arise from the claimed mechanism versus other side-effects of the auxiliary loss.

Authors: The shiftedMNIST dataset is deliberately constructed with known, explicit correlations between primary (digit) and secondary (shift) features. Our experiments isolate the mechanism by measuring accuracy when the primary feature is removed at test time; the observed gains with IFM versus cross-entropy alone serve as a controlled ablation of the claimed information-retention effect. Direct mutual-information estimation is computationally prohibitive for high-dimensional CNN features, but the performance differential on this controlled task provides quantitative support for the mechanism. We will add feature visualizations in a revision. revision: partial
Referee: [Experiments] Experiments section: all quantitative results are confined to shiftedMNIST (a dataset constructed precisely to embed explicit digit-shift correlations); no results, ablations, or hyper-parameter sensitivity analysis are reported on any natural-image benchmark, leaving the weakest assumption—that the loss behaves as intended outside this narrow setting—untested.

Authors: We agree that evaluation on natural-image benchmarks would strengthen claims of generality. The present work deliberately uses the controlled shiftedMNIST setting to validate the core hypothesis with known ground-truth correlations; extending the method to natural images is an important direction for future research and is outside the scope of this manuscript. revision: no

Circularity Check

0 steps flagged

No significant circularity in derivation of IFM loss

full rationale

The manuscript introduces the information flow maximization (IFM) loss as a novel regularization term whose functional form is defined independently of the classification objective. No equation reduces the proposed loss to a fitted parameter or to a quantity already present in the cross-entropy term; the loss is not obtained by renaming an existing empirical pattern; and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central claim therefore remains an independent modeling choice whose validity is tested on shiftedMNIST rather than being forced by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the IFM loss itself is the proposed addition but lacks implementation details for ledger entries.

pith-pipeline@v0.9.0 · 5703 in / 1060 out tokens · 22024 ms · 2026-05-25T13:10:58.648410+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 4 internal anchors

[1]

M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y . Bengio, A. Courville, and R. D. Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018. 4

work page arXiv 2018
[2]

X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems , pages 2172–2180, 2016

work page 2016
[3]

Cimpoi, S

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014

work page 2014
[4]

Gabri ´e, A

M. Gabri ´e, A. Manoel, C. Luneau, N. Macris, F. Krzakala, L. Zdeborov´a, et al. Entropy and mutual information in mod- els of deep neural networks. In Advances in Neural Informa- tion Processing Systems, pages 1821–1831, 2018

work page 2018
[5]

Geirhos, P

R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wich- mann, and W. Brendel. Imagenet-trained cnns are biased to- wards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018

work page arXiv 2018
[6]

R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y . Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Ilyas, S

A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples are not bugs, they are fea- tures. arXiv preprint arXiv:1905.02175, 2019

work page arXiv 1905
[8]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on In- ternational Conference on Machine Learning - Volume 37 , ICML’15, pages 448–456, 2015

work page 2015
[9]

Jacobsen, J

J.-H. Jacobsen, J. Behrmann, R. Zemel, and M. Bethge. Ex- cessive invariance causes adversarial vulnerability. arXiv preprint arXiv:1811.00401, 2018

work page arXiv 2018
[10]

Nowozin, B

S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence min- imization. In Advances in neural information processing sys- tems, pages 271–279, 2016

work page 2016
[11]

Opening the Black Box of Deep Neural Networks via Information

R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

The information bottleneck method

N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000

work page internal anchor Pith review Pith/arXiv arXiv 2000
[13]

Tishby and N

N. Tishby and N. Zaslavsky. Deep learning and the informa- tion bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2015

work page 2015
[14]

S. Zhao, J. Song, and S. Ermon. Infovae: Informa- tion maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y . Bengio, A. Courville, and R. D. Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018. 4

work page arXiv 2018

[2] [2]

X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems , pages 2172–2180, 2016

work page 2016

[3] [3]

Cimpoi, S

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014

work page 2014

[4] [4]

Gabri ´e, A

M. Gabri ´e, A. Manoel, C. Luneau, N. Macris, F. Krzakala, L. Zdeborov´a, et al. Entropy and mutual information in mod- els of deep neural networks. In Advances in Neural Informa- tion Processing Systems, pages 1821–1831, 2018

work page 2018

[5] [5]

Geirhos, P

R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wich- mann, and W. Brendel. Imagenet-trained cnns are biased to- wards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018

work page arXiv 2018

[6] [6]

R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y . Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Ilyas, S

A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples are not bugs, they are fea- tures. arXiv preprint arXiv:1905.02175, 2019

work page arXiv 1905

[8] [8]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on In- ternational Conference on Machine Learning - Volume 37 , ICML’15, pages 448–456, 2015

work page 2015

[9] [9]

Jacobsen, J

J.-H. Jacobsen, J. Behrmann, R. Zemel, and M. Bethge. Ex- cessive invariance causes adversarial vulnerability. arXiv preprint arXiv:1811.00401, 2018

work page arXiv 2018

[10] [10]

Nowozin, B

S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence min- imization. In Advances in neural information processing sys- tems, pages 271–279, 2016

work page 2016

[11] [11]

Opening the Black Box of Deep Neural Networks via Information

R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

The information bottleneck method

N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000

work page internal anchor Pith review Pith/arXiv arXiv 2000

[13] [13]

Tishby and N

N. Tishby and N. Zaslavsky. Deep learning and the informa- tion bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2015

work page 2015

[14] [14]

S. Zhao, J. Song, and S. Ermon. Infovae: Informa- tion maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017