pith. sign in

arxiv: 1706.00641 · v1 · pith:ZFACT3G6new · submitted 2017-06-02 · 📊 stat.AP

Improved high-dimensional prediction with Random Forests by the use of co-data

classification 📊 stat.AP
keywords co-datarandomforestdatademonstrategenemoderatedprobabilities
0
0 comments X
read the original abstract

Prediction in high dimensional settings is difficult due to large by number of variables relative to the sample size. We demonstrate how auxiliary "co-data" can be used to improve the performance of a Random Forest in such a setting. Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities (used to draw candidate variables, the default for a Random Forest) by co-data moderated sampling probabilities. Co-data here is defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate this co-data moderated Random Forest (CoRF) with one example. In the example we aim to predict a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.