Zipf's Law in Importance of Genes for Cancer Classification Using Microarray Data
read the original abstract
Microarray data consists of mRNA expression levels of thousands of genes under certain conditions. A difference in the expression level of a gene at two different conditions/phenotypes, such as cancerous versus non-cancerous, one subtype of cancer versus another, before versus after a drug treatment, is indicative of the relevance of that gene to the difference of the high-level phenotype. Each gene can be ranked by its ability to distinguish the two conditions. We study how the single-gene classification ability decreases with its rank (a Zipf's plot). Power-law function in the Zipf's plot is observed for the four microarray datasets obtained from various cancer studies. This power-law behavior in the Zipf's plot is reminiscent of similar power-law curves in other natural and social phenomena (Zipf's law). However, due to our choice of the measure of importance in classification ability, i.e., the maximized likelihood in a logistic regression, the exponent of the power-law function is a function of the sample size, instead of a fixed value close to 1 for a typical example of Zipf's law. The presence of this power-law behavior is important for deciding the number of genes to be used for a discriminant microarray data analysis.
This paper has not been read by Pith yet.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.