Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data
Introduction
Constructing a classification rule for tissue samples based on gene expression profiles has received much attention recently due to emerging microarray technology. A new challenge is that the number of genes (i.e. the dimension of inputs) is much larger than the number of tissue samples, in which case standard classification methods either are not applicable or perform badly. Also, identifying a small subset of informative genes, called marker genes, which discriminate types of tumors or tumor versus normal tissues, has become an important subject. Hence, good learning algorithms with gene expression data should provide a classification rule which not only yields high accuracy but also has the ability to identify marker genes. In related literature, Guyon et al. (2002) proposed a recursive feature elimination technique with support vector machines, Li et al. (2002) introduced two Bayesian approaches with the technique of automatic relevance determination, and Shevade and Keerthi (2003) and Roth (2002) applied the sparse logistic regression, to name just a few.
Among these tools, sparse logistic regression is a useful classification method for gene expression data. It gives a sparse solution with high accuracy and also it provides the user with explicit probabilities of classification apart from the class information. However, its optimal extension to more than two classes is not obvious. A standard multiclass extension of sparse logistic regression might be sparse multinomial logistic (SML) regression (Krishnapuram et al., 2004), which is a sparse version of the multinomial logit model—a popular multiclass formulation in statistics (see, for example, Agresti, 1990). SML, however, has a problem in gene selection. Since the estimates of the regression coefficients depend on the choice of the baseline class (see Section 2 for definition), and so do the selected genes. Hence, some important genes are dropped in the final model, which in turn degrades the prediction accuracies. Empirical results in Section 4 confirms this observation.
In this paper, we propose a new multiclass extension of sparse logistic regression called sparse one-against-all logistic (SOVAL) regression, whose main idea is to reduce a multiclass problem to multiple binary problems and to construct a classifier using the reduced multiple binary problems simultaneously. By analyzing five real data sets of gene expressions, we show that SOVAL outperforms SML in prediction accuracy as well as gene selectivity.
The paper is organized as follows. In Section 2, SOVAL as well as SML are presented. A computational algorithm based on the gradient LASSO algorithm of Kim et al. (2005) is given in Section 3. Results of numerical experiments are presented in Section 4 and concluding remarks follow in Section 5.
Section snippets
Models
Let be input–output pairs of a given data set where is a gene expression level and is a type of cancer of the ith tissue sample. Here, n is the number of tissues, p the number of genes and J the number of classes (i.e. tumor types). We first present SML and then propose SOVAL.
A computational algorithm
We first present a general version of the gradient LASSO algorithm developed by Kim et al. (2005), and explain how to modify it for SOVAL as well as SML. Let and be a convex function defined on The objective of the gradient LASSO is to find the minimizer of over where D is the subset of defined by Let be the vector in with the kth component equal 1 and the others 0. Fig. 1 is the gradient LASSO algorithm for this problem.
The hardest part of the
Numerical experiments
We compare the two multiclass extensions of sparse logistic regressions on five publicly available data sets.
Concluding remarks
In this paper, we proposed a multiclass extension of sparse logistic regression, so called SOVAL, compared it with SML, and developed the efficient computational algorithm suitable for gene expression data. The numerical experiments showed that SOVAL outperforms SML in many aspects. The former: (i) gives better accuracies in particular; (ii) has higher power of detecting important genes and (iii) does not require the choice of a baseline class.
The main idea of SOVAL is somehow related to the
Acknowledgments
The first author and second author were supported in part by KOSEF through the Statistical Research Center for Complex Systems at Seoul National University. The third author was supported in part by KOSEF (R14-2003-002-01000).
References (18)
Categorical Data Analysis
(1990)- et al.
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Nature
(2000) Better subset regression using the nonnegative garrote
Technometrics
(1995)- et al.
Comparison of discrimination methods for the classification of tumors using gene expression data
J. Amer. Statist. Assoc.
(2002) - et al.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
(1999) - et al.
Gene selection for cancer classification using support vector machines
Mach. Learn.
(2002) - Jung, S.H., Jang, W., 2006. How accurately can we control the FDR in analyzing microarray data? Bioinformatics, to...
- et al.
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks
Nature Med.
(2001) - Kim, J., Kim, Y., Kim, Y., 2005. A gradient descent algorithm for generalized LASSO. Technical Report, Department of...