pith. sign in

arxiv: 1907.07384 · v1 · pith:JOPGGPL5new · submitted 2019-07-17 · 💻 cs.LG · stat.ML

Feature Selection via Mutual Information: New Theoretical Insights

Pith reviewed 2026-05-24 20:31 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords feature selectionmutual informationconditional mutual informationgreedy algorithmserror boundsregressionclassification
0
0 comments X

The pith

Conditional mutual information arises when bounding the ideal prediction errors of feature subsets, leading to guaranteed stopping conditions for greedy selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that conditional mutual information between the target variable and unselected features, given the selected ones, provides bounds on the lowest achievable regression or classification error. This theoretical link allows the design of a stopping condition for forward and backward greedy feature selection algorithms that keeps the ideal error below a user-chosen threshold. Current heuristic methods lack such guarantees on solution quality. A reader might care because it offers a way to perform feature selection with explicit control over prediction performance in high-dimensional settings.

Core claim

The authors prove that conditional mutual information naturally appears in the derivation of bounds on the ideal errors achieved by different feature subsets. Using this, they introduce a novel stopping condition for greedy methods ensuring the selected subset's prediction error remains bounded.

What carries the argument

Conditional mutual information used to bound ideal regression and classification errors for feature subsets.

If this is right

  • Greedy algorithms gain a theoretically justified termination criterion based on error bounds.
  • The approach works for both regression and classification tasks.
  • Feature subsets can be selected while controlling the maximum ideal error.
  • Mutual information based selection gains theoretical support beyond heuristics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This bound could be approximated using empirical estimates in practice to make the method usable.
  • It may inspire similar information-theoretic bounds in other selection algorithms.
  • Validation on benchmark datasets would test if the bounds are tight enough for real-world use.

Load-bearing premise

The error bounds derived from conditional mutual information can be evaluated or approximated sufficiently well to serve as a practical stopping condition without requiring knowledge of the true data distribution.

What would settle it

A simulation or real dataset where the stopping condition is triggered but the actual ideal prediction error of the selected subset exceeds the specified threshold.

Figures

Figures reproduced from arXiv: 1907.07384 by Alberto Maria Metelli, Andrea Tirinzoni, Marcello Restelli, Mario Beraha, Matteo Papini.

Figure 1
Figure 1. Figure 1: SVM test accuracy for different choices of the number of features that generate the problem and different stopping criteria. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Classification accuracy and fraction of selected features as a function [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Mutual information has been successfully adopted in filter feature-selection methods to assess both the relevancy of a subset of features in predicting the target variable and the redundancy with respect to other variables. However, existing algorithms are mostly heuristic and do not offer any guarantee on the proposed solution. In this paper, we provide novel theoretical results showing that conditional mutual information naturally arises when bounding the ideal regression/classification errors achieved by different subsets of features. Leveraging on these insights, we propose a novel stopping condition for backward and forward greedy methods which ensures that the ideal prediction error using the selected feature subset remains bounded by a user-specified threshold. We provide numerical simulations to support our theoretical claims and compare to common heuristic methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper derives theoretical bounds on the ideal (Bayes) regression and classification error achieved by a feature subset, showing that conditional mutual information arises naturally in these bounds from standard information-theoretic identities. It then introduces a stopping criterion for forward and backward greedy feature-selection algorithms that guarantees the selected subset keeps the ideal error below a user-specified threshold, and supports the claims with numerical simulations comparing the new criterion to common heuristic MI-based methods.

Significance. If the derived bounds are non-vacuous and the stopping condition can be approximated from data without knowledge of the true distribution, the work would supply the first explicit performance guarantees for a widely used class of filter methods, moving MI-based selection from purely heuristic to theoretically grounded. The simulations provide initial empirical support, but the practical utility hinges on bound tightness and computability.

major comments (2)
  1. [§4, Eq. (12)] §4, Eq. (12) and the stopping condition (13): the bound on ideal error is expressed in terms of conditional MI quantities that are not directly observable; the manuscript does not provide a concrete estimator or concentration result showing that the stopping threshold can be reliably checked from finite samples without additional assumptions on the joint distribution.
  2. [Theorem 2] Theorem 2 (forward greedy case): the proof that the stopping condition preserves the error bound relies on the submodularity or monotonicity properties of conditional MI; it is unclear whether these properties hold for the specific error functional used or only under the additional modeling assumptions stated in §3.1.
minor comments (3)
  1. [§2] Notation for conditional MI I(X;Y|Z) is introduced without an explicit reminder of the chain-rule identity used in the derivation; adding one sentence in §2 would improve readability.
  2. [Figure 2] Figure 2 caption does not state the number of Monte-Carlo repetitions or the precise definition of 'ideal error' plotted; this makes direct comparison to the theoretical bound difficult.
  3. [§1.2] The related-work section omits the recent information-theoretic analyses of greedy selection by [specific papers on submodular MI bounds]; a brief comparison would strengthen the novelty claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and the positive recommendation for minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: §4, Eq. (12) and the stopping condition (13): the bound on ideal error is expressed in terms of conditional MI quantities that are not directly observable; the manuscript does not provide a concrete estimator or concentration result showing that the stopping threshold can be reliably checked from finite samples without additional assumptions on the joint distribution.

    Authors: The manuscript focuses on deriving theoretical bounds using population quantities of conditional mutual information. We do not provide estimators or concentration results because the goal is to establish the connection between conditional MI and ideal prediction errors, and to propose the stopping condition in terms of these quantities. In practice, one can use standard nonparametric estimators for MI from data, as done in our simulations. Adding finite-sample guarantees would require stronger assumptions and is beyond the current scope, though we agree it is an important practical consideration. revision: no

  2. Referee: Theorem 2 (forward greedy case): the proof that the stopping condition preserves the error bound relies on the submodularity or monotonicity properties of conditional MI; it is unclear whether these properties hold for the specific error functional used or only under the additional modeling assumptions stated in §3.1.

    Authors: The proof of Theorem 2 applies the known monotonicity and submodularity properties of conditional mutual information, which hold under the general setting. These properties are used to show that the stopping condition ensures the bound on the error functional remains satisfied. The error bounds in §4 are derived directly from information-theoretic identities that hold under the assumptions of §3.1, so the properties carry over. We will revise the manuscript to make this connection more explicit by adding a brief discussion of the relevant properties of conditional MI. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract frames the core contribution as deriving that conditional mutual information arises naturally when bounding ideal regression/classification errors from standard information-theoretic identities. No equations, self-definitions, or fitted-parameter-as-prediction steps are visible. The proposed stopping condition for greedy selection follows directly from these bounds rather than reducing to an input by construction. No self-citation load-bearing or ansatz smuggling is indicated. The derivation is self-contained against external information-theoretic benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract only; no explicit free parameters, new axioms, or invented entities are described. The work rests on standard definitions of mutual information and conditional mutual information.

axioms (1)
  • standard math Standard definitions and chain-rule properties of mutual information and conditional mutual information from information theory.
    Invoked to derive the error bounds mentioned in the abstract.

pith-pipeline@v0.9.0 · 5649 in / 1078 out tokens · 31458 ms · 2026-05-24T20:31:31.097191+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Predicting Fetal Birthweight from High Dimensional Data using Advanced Machine Learning

    cs.LG 2025-02 unverdicted novelty 3.0

    Machine learning pipeline with MICE imputation, tree-based feature selection, and ensemble models predicts birth weight, claiming improved performance on constrained clinical datasets.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 1 Pith paper

  1. [1]

    A survey on feature selection methods,

    G. Chandrashekar and F. Sahin, “A survey on feature selection methods,” Computers & Electrical Engineering , vol. 40, no. 1, pp. 16–28, 2014

  2. [2]

    Wrappers for feature subset selection,

    R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial intelligence, vol. 97, no. 1-2, pp. 273–324, 1997

  3. [3]

    Embedded methods,

    T. N. Lal, O. Chapelle, J. Weston, and A. Elisseeff, “Embedded methods,” in Feature extraction. Springer, 2006, pp. 137–165

  4. [4]

    Feature selection for svms,

    J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V . Vap- nik, “Feature selection for svms,” in Advances in neural information processing systems, 2001, pp. 668–674

  5. [5]

    Feature selection and ranking filters,

    W. Duch, T. Winiarski, J. Biesiada, and A. Kachel, “Feature selection and ranking filters,” in International conference on artificial neural networks (ICANN) and International conference on neural information processing (ICONIP), vol. 251. Citeseer, 2003, p. 254

  6. [6]

    Filter methods,

    W. Duch, “Filter methods,” in Feature Extraction. Springer, 2006, pp. 89–117

  7. [7]

    A review of feature selection methods based on mutual information,

    J. R. Vergara and P. A. Est ´evez, “A review of feature selection methods based on mutual information,” Neural computing and applications , vol. 24, no. 1, pp. 175–186, 2014

  8. [8]

    T. M. Cover and J. A. Thomas, Elements of information theory . John Wiley & Sons, 2012

  9. [9]

    Feature selection and feature extraction for text categoriza- tion,

    D. D. Lewis, “Feature selection and feature extraction for text categoriza- tion,” in Proceedings of the workshop on Speech and Natural Language. Association for Computational Linguistics, 1992, pp. 212–217

  10. [10]

    Using mutual information for selecting features in supervised neural net learning,

    R. Battiti, “Using mutual information for selecting features in supervised neural net learning,” IEEE Transactions on neural networks, vol. 5, no. 4, pp. 537–550, 1994

  11. [11]

    Data visualization and feature selection: New algorithms for nongaussian data,

    H. H. Yang and J. Moody, “Data visualization and feature selection: New algorithms for nongaussian data,” in Advances in Neural Information Processing Systems, 2000, pp. 687–693

  12. [12]

    Fast binary feature selection with conditional mutual infor- mation,

    F. Fleuret, “Fast binary feature selection with conditional mutual infor- mation,” Journal of Machine Learning Research , vol. 5, no. Nov, pp. 1531–1555, 2004

  13. [13]

    Conditional infomax learning: an integrated framework for feature extraction and fusion,

    D. Lin and X. Tang, “Conditional infomax learning: an integrated framework for feature extraction and fusion,” in European Conference on Computer Vision . Springer, 2006, pp. 68–82

  14. [14]

    Conditional mutual information-based feature selection analyzing for synergy and redun- dancy,

    g. Cheng, Z. Qin, C. Feng, Y . Wang, and F. Li, “Conditional mutual information-based feature selection analyzing for synergy and redun- dancy,” Etri Journal, vol. 33, no. 2, pp. 210–218, 2011

  15. [15]

    Conditional likelihood maximisation: a unifying framework for information theoretic feature selection,

    G. Brown, A. Pocock, M.-J. Zhao, and M. Luj ´an, “Conditional likelihood maximisation: a unifying framework for information theoretic feature selection,” Journal of machine learning research , vol. 13, no. Jan, pp. 27–66, 2012

  16. [16]

    Algorithms for large scale markov blanket discovery

    I. Tsamardinos, C. F. Aliferis, A. R. Statnikov, and E. Statnikov, “Algorithms for large scale markov blanket discovery.” in FLAIRS conference, vol. 2, 2003, pp. 376–380

  17. [17]

    Information and information stability of random vari- ables and processes,

    M. S. Pinsker, “Information and information stability of random vari- ables and processes,” 1960

  18. [18]

    Information-type measures of difference of probability dis- tributions and indirect observation,

    I. Csisz ´ar, “Information-type measures of difference of probability dis- tributions and indirect observation,” studia scientiarum Mathematicarum Hungarica, vol. 2, pp. 229–318, 1967

  19. [19]

    A lower bound for discrimination information in terms of variation (corresp.),

    S. Kullback, “A lower bound for discrimination information in terms of variation (corresp.),” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 126–127, 1967

  20. [20]

    Irrelevant features and the subset selection problem,

    G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant features and the subset selection problem,” in Machine Learning Proceedings 1994 . Elsevier, 1994, pp. 121–129

  21. [21]

    Estimation of entropy and mutual information,

    L. Paninski, “Estimation of entropy and mutual information,” Neural computation, vol. 15, no. 6, pp. 1191–1253, 2003

  22. [22]

    Nearest neighbor estimate of conditional mutual information in feature selection,

    A. Tsimpiris, I. Vlachos, and D. Kugiumtzis, “Nearest neighbor estimate of conditional mutual information in feature selection,” Expert Systems with Applications, vol. 39, no. 16, pp. 12 697–12 708, 2012

  23. [23]

    Estimating mutual information for discrete-continuous mixtures,

    W. Gao, S. Kannan, S. Oh, and P. Viswanath, “Estimating mutual information for discrete-continuous mixtures,” in Advances in Neural Information Processing Systems , 2017, pp. 5986–5997

  24. [24]

    Kraskov, H

    A. Kraskov, H. St ¨ogbauer, and P. Grassberger, “Estimating mutual information,” Phys. Rev. E , vol. 69, p. 066138, Jun 2004. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevE.69.066138

  25. [25]

    Functional properties of minimum mean-square error and mutual information,

    Y . Wu and S. Verd ´u, “Functional properties of minimum mean-square error and mutual information,” IEEE Transactions on Information The- ory, vol. 58, no. 3, pp. 1289–1301, 2012

  26. [26]

    Feature selection with a linear dependence measure,

    S. K. Das, “Feature selection with a linear dependence measure,” IEEE transactions on Computers , vol. 100, no. 9, pp. 1106–1109, 1971

  27. [27]

    Correlation-based feature selection for machine learning,

    M. A. Hall, “Correlation-based feature selection for machine learning,” 1999

  28. [28]

    Feature selection for high-dimensional data: A fast correlation-based filter solution,

    L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast correlation-based filter solution,” in Proceedings of the 20th interna- tional conference on machine learning (ICML-03) , 2003, pp. 856–863

  29. [29]

    Feature selection for high-dimensional dataa pearson redundancy based filter,

    J. Biesiada and W. Duch, “Feature selection for high-dimensional dataa pearson redundancy based filter,” in Computer recognition systems 2 . Springer, 2007, pp. 242–249

  30. [30]

    Lin- ear correlation-based feature selection for network intrusion detection model,

    H. F. Eid, A. E. Hassanien, T.-h. Kim, and S. Banerjee, “Lin- ear correlation-based feature selection for network intrusion detection model,” in Advances in Security of Information and Communication Networks. Springer, 2013, pp. 240–248

  31. [31]

    Kernel feature selection via conditional covariance minimization,

    J. Chen, M. Stern, M. J. Wainwright, and M. I. Jordan, “Kernel feature selection via conditional covariance minimization,” in Advances in Neural Information Processing Systems , 2017, pp. 6946–6955

  32. [32]

    UCI machine learning repository,

    D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,”

  33. [33]

    Available: http://archive.ics.uci.edu/ml

    [Online]. Available: http://archive.ics.uci.edu/ml