Random Partitioning and Distribution-based Thresholding for Iterative Variable Screening in High Dimensions

Su-Yun Huang; Tzee-Ming Huang; Yu-Hsiang Cheng

read the original abstract

In big data analysis, a simple task such as linear regression can become very challenging as the variable dimension $p$ grows. As a result, variable screening is inevitable in many scientific studies. In recent years, randomized algorithms have become a new trend and are playing an increasingly important role for large scale data analysis. In this article, we combine the ideas of variable screening and random partitioning to propose a new iterative variable screening method. For moderate sized $p$ of order $O(n^{2-\delta})$, we propose a basic algorithm that adopts a distribution-based thresholding rule. For very large $p$, we further propose a two-stage procedure. This two-stage procedure first performs a random partitioning to divide predictors into subsets of manageable size of order $O(n^{2-\delta})$ for variable screening, where $\delta >0$ can be an arbitrarily small positive number. Random partitioning is repeated a few times. Next, the final estimate of variable subset is obtained by integrating results obtained from multiple random partitions. Simulation studies show that our method works well and outperforms some renowned competitors. Real data applications are presented. Our algorithms are able to handle predictors in the size of millions.

Random Partitioning and Distribution-based Thresholding for Iterative Variable Screening in High Dimensions

discussion (0)