Feature Selection via Mutual Information: New Theoretical Insights
Pith reviewed 2026-05-24 20:31 UTC · model grok-4.3
The pith
Conditional mutual information arises when bounding the ideal prediction errors of feature subsets, leading to guaranteed stopping conditions for greedy selection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors prove that conditional mutual information naturally appears in the derivation of bounds on the ideal errors achieved by different feature subsets. Using this, they introduce a novel stopping condition for greedy methods ensuring the selected subset's prediction error remains bounded.
What carries the argument
Conditional mutual information used to bound ideal regression and classification errors for feature subsets.
If this is right
- Greedy algorithms gain a theoretically justified termination criterion based on error bounds.
- The approach works for both regression and classification tasks.
- Feature subsets can be selected while controlling the maximum ideal error.
- Mutual information based selection gains theoretical support beyond heuristics.
Where Pith is reading between the lines
- This bound could be approximated using empirical estimates in practice to make the method usable.
- It may inspire similar information-theoretic bounds in other selection algorithms.
- Validation on benchmark datasets would test if the bounds are tight enough for real-world use.
Load-bearing premise
The error bounds derived from conditional mutual information can be evaluated or approximated sufficiently well to serve as a practical stopping condition without requiring knowledge of the true data distribution.
What would settle it
A simulation or real dataset where the stopping condition is triggered but the actual ideal prediction error of the selected subset exceeds the specified threshold.
Figures
read the original abstract
Mutual information has been successfully adopted in filter feature-selection methods to assess both the relevancy of a subset of features in predicting the target variable and the redundancy with respect to other variables. However, existing algorithms are mostly heuristic and do not offer any guarantee on the proposed solution. In this paper, we provide novel theoretical results showing that conditional mutual information naturally arises when bounding the ideal regression/classification errors achieved by different subsets of features. Leveraging on these insights, we propose a novel stopping condition for backward and forward greedy methods which ensures that the ideal prediction error using the selected feature subset remains bounded by a user-specified threshold. We provide numerical simulations to support our theoretical claims and compare to common heuristic methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper derives theoretical bounds on the ideal (Bayes) regression and classification error achieved by a feature subset, showing that conditional mutual information arises naturally in these bounds from standard information-theoretic identities. It then introduces a stopping criterion for forward and backward greedy feature-selection algorithms that guarantees the selected subset keeps the ideal error below a user-specified threshold, and supports the claims with numerical simulations comparing the new criterion to common heuristic MI-based methods.
Significance. If the derived bounds are non-vacuous and the stopping condition can be approximated from data without knowledge of the true distribution, the work would supply the first explicit performance guarantees for a widely used class of filter methods, moving MI-based selection from purely heuristic to theoretically grounded. The simulations provide initial empirical support, but the practical utility hinges on bound tightness and computability.
major comments (2)
- [§4, Eq. (12)] §4, Eq. (12) and the stopping condition (13): the bound on ideal error is expressed in terms of conditional MI quantities that are not directly observable; the manuscript does not provide a concrete estimator or concentration result showing that the stopping threshold can be reliably checked from finite samples without additional assumptions on the joint distribution.
- [Theorem 2] Theorem 2 (forward greedy case): the proof that the stopping condition preserves the error bound relies on the submodularity or monotonicity properties of conditional MI; it is unclear whether these properties hold for the specific error functional used or only under the additional modeling assumptions stated in §3.1.
minor comments (3)
- [§2] Notation for conditional MI I(X;Y|Z) is introduced without an explicit reminder of the chain-rule identity used in the derivation; adding one sentence in §2 would improve readability.
- [Figure 2] Figure 2 caption does not state the number of Monte-Carlo repetitions or the precise definition of 'ideal error' plotted; this makes direct comparison to the theoretical bound difficult.
- [§1.2] The related-work section omits the recent information-theoretic analyses of greedy selection by [specific papers on submodular MI bounds]; a brief comparison would strengthen the novelty claim.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and the positive recommendation for minor revision. We address each major comment below.
read point-by-point responses
-
Referee: §4, Eq. (12) and the stopping condition (13): the bound on ideal error is expressed in terms of conditional MI quantities that are not directly observable; the manuscript does not provide a concrete estimator or concentration result showing that the stopping threshold can be reliably checked from finite samples without additional assumptions on the joint distribution.
Authors: The manuscript focuses on deriving theoretical bounds using population quantities of conditional mutual information. We do not provide estimators or concentration results because the goal is to establish the connection between conditional MI and ideal prediction errors, and to propose the stopping condition in terms of these quantities. In practice, one can use standard nonparametric estimators for MI from data, as done in our simulations. Adding finite-sample guarantees would require stronger assumptions and is beyond the current scope, though we agree it is an important practical consideration. revision: no
-
Referee: Theorem 2 (forward greedy case): the proof that the stopping condition preserves the error bound relies on the submodularity or monotonicity properties of conditional MI; it is unclear whether these properties hold for the specific error functional used or only under the additional modeling assumptions stated in §3.1.
Authors: The proof of Theorem 2 applies the known monotonicity and submodularity properties of conditional mutual information, which hold under the general setting. These properties are used to show that the stopping condition ensures the bound on the error functional remains satisfied. The error bounds in §4 are derived directly from information-theoretic identities that hold under the assumptions of §3.1, so the properties carry over. We will revise the manuscript to make this connection more explicit by adding a brief discussion of the relevant properties of conditional MI. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The abstract frames the core contribution as deriving that conditional mutual information arises naturally when bounding ideal regression/classification errors from standard information-theoretic identities. No equations, self-definitions, or fitted-parameter-as-prediction steps are visible. The proposed stopping condition for greedy selection follows directly from these bounds rather than reducing to an input by construction. No self-citation load-bearing or ansatz smuggling is indicated. The derivation is self-contained against external information-theoretic benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard definitions and chain-rule properties of mutual information and conditional mutual information from information theory.
Forward citations
Cited by 1 Pith paper
-
Predicting Fetal Birthweight from High Dimensional Data using Advanced Machine Learning
Machine learning pipeline with MICE imputation, tree-based feature selection, and ensemble models predicts birth weight, claiming improved performance on constrained clinical datasets.
Reference graph
Works this paper leans on
-
[1]
A survey on feature selection methods,
G. Chandrashekar and F. Sahin, “A survey on feature selection methods,” Computers & Electrical Engineering , vol. 40, no. 1, pp. 16–28, 2014
work page 2014
-
[2]
Wrappers for feature subset selection,
R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial intelligence, vol. 97, no. 1-2, pp. 273–324, 1997
work page 1997
-
[3]
T. N. Lal, O. Chapelle, J. Weston, and A. Elisseeff, “Embedded methods,” in Feature extraction. Springer, 2006, pp. 137–165
work page 2006
-
[4]
J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V . Vap- nik, “Feature selection for svms,” in Advances in neural information processing systems, 2001, pp. 668–674
work page 2001
-
[5]
Feature selection and ranking filters,
W. Duch, T. Winiarski, J. Biesiada, and A. Kachel, “Feature selection and ranking filters,” in International conference on artificial neural networks (ICANN) and International conference on neural information processing (ICONIP), vol. 251. Citeseer, 2003, p. 254
work page 2003
-
[6]
W. Duch, “Filter methods,” in Feature Extraction. Springer, 2006, pp. 89–117
work page 2006
-
[7]
A review of feature selection methods based on mutual information,
J. R. Vergara and P. A. Est ´evez, “A review of feature selection methods based on mutual information,” Neural computing and applications , vol. 24, no. 1, pp. 175–186, 2014
work page 2014
-
[8]
T. M. Cover and J. A. Thomas, Elements of information theory . John Wiley & Sons, 2012
work page 2012
-
[9]
Feature selection and feature extraction for text categoriza- tion,
D. D. Lewis, “Feature selection and feature extraction for text categoriza- tion,” in Proceedings of the workshop on Speech and Natural Language. Association for Computational Linguistics, 1992, pp. 212–217
work page 1992
-
[10]
Using mutual information for selecting features in supervised neural net learning,
R. Battiti, “Using mutual information for selecting features in supervised neural net learning,” IEEE Transactions on neural networks, vol. 5, no. 4, pp. 537–550, 1994
work page 1994
-
[11]
Data visualization and feature selection: New algorithms for nongaussian data,
H. H. Yang and J. Moody, “Data visualization and feature selection: New algorithms for nongaussian data,” in Advances in Neural Information Processing Systems, 2000, pp. 687–693
work page 2000
-
[12]
Fast binary feature selection with conditional mutual infor- mation,
F. Fleuret, “Fast binary feature selection with conditional mutual infor- mation,” Journal of Machine Learning Research , vol. 5, no. Nov, pp. 1531–1555, 2004
work page 2004
-
[13]
Conditional infomax learning: an integrated framework for feature extraction and fusion,
D. Lin and X. Tang, “Conditional infomax learning: an integrated framework for feature extraction and fusion,” in European Conference on Computer Vision . Springer, 2006, pp. 68–82
work page 2006
-
[14]
Conditional mutual information-based feature selection analyzing for synergy and redun- dancy,
g. Cheng, Z. Qin, C. Feng, Y . Wang, and F. Li, “Conditional mutual information-based feature selection analyzing for synergy and redun- dancy,” Etri Journal, vol. 33, no. 2, pp. 210–218, 2011
work page 2011
-
[15]
G. Brown, A. Pocock, M.-J. Zhao, and M. Luj ´an, “Conditional likelihood maximisation: a unifying framework for information theoretic feature selection,” Journal of machine learning research , vol. 13, no. Jan, pp. 27–66, 2012
work page 2012
-
[16]
Algorithms for large scale markov blanket discovery
I. Tsamardinos, C. F. Aliferis, A. R. Statnikov, and E. Statnikov, “Algorithms for large scale markov blanket discovery.” in FLAIRS conference, vol. 2, 2003, pp. 376–380
work page 2003
-
[17]
Information and information stability of random vari- ables and processes,
M. S. Pinsker, “Information and information stability of random vari- ables and processes,” 1960
work page 1960
-
[18]
Information-type measures of difference of probability dis- tributions and indirect observation,
I. Csisz ´ar, “Information-type measures of difference of probability dis- tributions and indirect observation,” studia scientiarum Mathematicarum Hungarica, vol. 2, pp. 229–318, 1967
work page 1967
-
[19]
A lower bound for discrimination information in terms of variation (corresp.),
S. Kullback, “A lower bound for discrimination information in terms of variation (corresp.),” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 126–127, 1967
work page 1967
-
[20]
Irrelevant features and the subset selection problem,
G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant features and the subset selection problem,” in Machine Learning Proceedings 1994 . Elsevier, 1994, pp. 121–129
work page 1994
-
[21]
Estimation of entropy and mutual information,
L. Paninski, “Estimation of entropy and mutual information,” Neural computation, vol. 15, no. 6, pp. 1191–1253, 2003
work page 2003
-
[22]
Nearest neighbor estimate of conditional mutual information in feature selection,
A. Tsimpiris, I. Vlachos, and D. Kugiumtzis, “Nearest neighbor estimate of conditional mutual information in feature selection,” Expert Systems with Applications, vol. 39, no. 16, pp. 12 697–12 708, 2012
work page 2012
-
[23]
Estimating mutual information for discrete-continuous mixtures,
W. Gao, S. Kannan, S. Oh, and P. Viswanath, “Estimating mutual information for discrete-continuous mixtures,” in Advances in Neural Information Processing Systems , 2017, pp. 5986–5997
work page 2017
-
[24]
A. Kraskov, H. St ¨ogbauer, and P. Grassberger, “Estimating mutual information,” Phys. Rev. E , vol. 69, p. 066138, Jun 2004. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevE.69.066138
-
[25]
Functional properties of minimum mean-square error and mutual information,
Y . Wu and S. Verd ´u, “Functional properties of minimum mean-square error and mutual information,” IEEE Transactions on Information The- ory, vol. 58, no. 3, pp. 1289–1301, 2012
work page 2012
-
[26]
Feature selection with a linear dependence measure,
S. K. Das, “Feature selection with a linear dependence measure,” IEEE transactions on Computers , vol. 100, no. 9, pp. 1106–1109, 1971
work page 1971
-
[27]
Correlation-based feature selection for machine learning,
M. A. Hall, “Correlation-based feature selection for machine learning,” 1999
work page 1999
-
[28]
Feature selection for high-dimensional data: A fast correlation-based filter solution,
L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast correlation-based filter solution,” in Proceedings of the 20th interna- tional conference on machine learning (ICML-03) , 2003, pp. 856–863
work page 2003
-
[29]
Feature selection for high-dimensional dataa pearson redundancy based filter,
J. Biesiada and W. Duch, “Feature selection for high-dimensional dataa pearson redundancy based filter,” in Computer recognition systems 2 . Springer, 2007, pp. 242–249
work page 2007
-
[30]
Lin- ear correlation-based feature selection for network intrusion detection model,
H. F. Eid, A. E. Hassanien, T.-h. Kim, and S. Banerjee, “Lin- ear correlation-based feature selection for network intrusion detection model,” in Advances in Security of Information and Communication Networks. Springer, 2013, pp. 240–248
work page 2013
-
[31]
Kernel feature selection via conditional covariance minimization,
J. Chen, M. Stern, M. J. Wainwright, and M. I. Jordan, “Kernel feature selection via conditional covariance minimization,” in Advances in Neural Information Processing Systems , 2017, pp. 6946–6955
work page 2017
-
[32]
UCI machine learning repository,
D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,”
-
[33]
Available: http://archive.ics.uci.edu/ml
[Online]. Available: http://archive.ics.uci.edu/ml
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.