Feature selection for high-dimensional integrated data

Charles Zheng; Ivan Ivanov; Raymond Carroll; Robert Chapkin; Scott Schwartz

arxiv: 1111.6283 · v1 · pith:DRVBXWY3new · submitted 2011-11-27 · 📊 stat.AP

Feature selection for high-dimensional integrated data

Charles Zheng , Scott Schwartz , Robert Chapkin , Raymond Carroll , Ivan Ivanov This is my paper

classification 📊 stat.AP

keywords featureselectionmethodspredictorsthresholdingaccuracyapproximationasymptotic

0 comments

read the original abstract

Motivated by the problem of identifying correlations between genes or features of two related biological systems, we propose a model of \emph{feature selection} in which only a subset of the predictors $X_t$ are dependent on the multidimensional variate $Y$, and the remainder of the predictors constitute a "noise set" $X_u$ independent of $Y$. Using Monte Carlo simulations, we investigated the relative performance of two methods: thresholding and singular-value decomposition, in combination with stochastic optimization to determine "empirical bounds" on the small-sample accuracy of an asymptotic approximation. We demonstrate utility of the thresholding and SVD feature selection methods to with respect to a recent infant intestinal gene expression and metagenomics dataset.

This paper has not been read by Pith yet.

Feature selection for high-dimensional integrated data

discussion (0)