Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data

doi:10.1016/j.csda.2006.06.007

Computational Statistics & Data Analysis

Volume 51, Issue 3, 1 December 2006, Pages 1643-1655

https://doi.org/10.1016/j.csda.2006.06.007 Get rights and content

Abstract

Monitoring gene expression profiles is a novel approach to cancer diagnosis. Several studies have showed that the sparse logistic regression is a useful classification method for gene expression data. Not only does it give a sparse solution with high accuracy, it provides the user with explicit probabilities of classification apart from the class information. However, its optimal extension to more than two classes is not obvious. In this paper, we propose a multiclass extension of sparse logistic regression. Analysis of five publicly available gene expression data sets shows that the proposed method outperforms the standard multinomial logistic model in prediction accuracy as well as gene selectivity.

Introduction

Constructing a classification rule for tissue samples based on gene expression profiles has received much attention recently due to emerging microarray technology. A new challenge is that the number of genes (i.e. the dimension of inputs) is much larger than the number of tissue samples, in which case standard classification methods either are not applicable or perform badly. Also, identifying a small subset of informative genes, called marker genes, which discriminate types of tumors or tumor versus normal tissues, has become an important subject. Hence, good learning algorithms with gene expression data should provide a classification rule which not only yields high accuracy but also has the ability to identify marker genes. In related literature, Guyon et al. (2002) proposed a recursive feature elimination technique with support vector machines, Li et al. (2002) introduced two Bayesian approaches with the technique of automatic relevance determination, and Shevade and Keerthi (2003) and Roth (2002) applied the sparse logistic regression, to name just a few.

Among these tools, sparse logistic regression is a useful classification method for gene expression data. It gives a sparse solution with high accuracy and also it provides the user with explicit probabilities of classification apart from the class information. However, its optimal extension to more than two classes is not obvious. A standard multiclass extension of sparse logistic regression might be sparse multinomial logistic (SML) regression (Krishnapuram et al., 2004), which is a sparse version of the multinomial logit model—a popular multiclass formulation in statistics (see, for example, Agresti, 1990). SML, however, has a problem in gene selection. Since the estimates of the regression coefficients depend on the choice of the baseline class (see Section 2 for definition), and so do the selected genes. Hence, some important genes are dropped in the final model, which in turn degrades the prediction accuracies. Empirical results in Section 4 confirms this observation.

In this paper, we propose a new multiclass extension of sparse logistic regression called sparse one-against-all logistic (SOVAL) regression, whose main idea is to reduce a multiclass problem to multiple binary problems and to construct a classifier using the reduced multiple binary problems simultaneously. By analyzing five real data sets of gene expressions, we show that SOVAL outperforms SML in prediction accuracy as well as gene selectivity.

The paper is organized as follows. In Section 2, SOVAL as well as SML are presented. A computational algorithm based on the gradient LASSO algorithm of Kim et al. (2005) is given in Section 3. Results of numerical experiments are presented in Section 4 and concluding remarks follow in Section 5.

Section snippets

Models

Let $\{(x_{1}, y_{1}), \dots, (x_{n}, y_{n})\}$ be input–output pairs of a given data set where $x_{i} \in R^{p}$ is a gene expression level and $y_{i} \in {1, 2, \dots, J}$ is a type of cancer of the ith tissue sample. Here, n is the number of tissues, p the number of genes and J the number of classes (i.e. tumor types). We first present SML and then propose SOVAL.

A computational algorithm

We first present a general version of the gradient LASSO algorithm developed by Kim et al. (2005), and explain how to modify it for SOVAL as well as SML. Let $z \in R^{q}$ and $L (z)$ be a convex function defined on $R^{q} .$ The objective of the gradient LASSO is to find the minimizer of $L (z)$ over $z \in D$ where D is the subset of $R^{q}$ defined by $D = \{z \in R^{q} : \sum_{k = 1}^{q} |z_{k}| ⩽ 1\} .$ Let $e_{k}$ be the vector in $R^{q}$ with the kth component equal 1 and the others 0. Fig. 1 is the gradient LASSO algorithm for this problem.

The hardest part of the

Numerical experiments

We compare the two multiclass extensions of sparse logistic regressions on five publicly available data sets.

Concluding remarks

In this paper, we proposed a multiclass extension of sparse logistic regression, so called SOVAL, compared it with SML, and developed the efficient computational algorithm suitable for gene expression data. The numerical experiments showed that SOVAL outperforms SML in many aspects. The former: (i) gives better accuracies in particular; (ii) has higher power of detecting important genes and (iii) does not require the choice of a baseline class.

The main idea of SOVAL is somehow related to the

Acknowledgments

The first author and second author were supported in part by KOSEF through the Statistical Research Center for Complex Systems at Seoul National University. The third author was supported in part by KOSEF (R14-2003-002-01000).

References (18)

A. Agresti
Categorical Data Analysis
(1990)
A. Alizadeh et al.
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Nature
(2000)
L. Breiman
Better subset regression using the nonnegative garrote
Technometrics
(1995)
S. Dudoit et al.
Comparison of discrimination methods for the classification of tumors using gene expression data
J. Amer. Statist. Assoc.
(2002)
T. Golub et al.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
(1999)
I. Guyon et al.
Gene selection for cancer classification using support vector machines
Mach. Learn.
(2002)
Jung, S.H., Jang, W., 2006. How accurately can we control the FDR in analyzing microarray data? Bioinformatics, to...
J. Khan et al.
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks
Nature Med.
(2001)
Kim, J., Kim, Y., Kim, Y., 2005. A gradient descent algorithm for generalized LASSO. Technical Report, Department of...

There are more references available in the full text version of this article.

Cited by (0)

View full text

Computational Statistics & Data Analysis

Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data

Abstract

Introduction

Section snippets

Models

A computational algorithm

Numerical experiments

Concluding remarks

Acknowledgments

Categorical Data Analysis

Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

Nature

Better subset regression using the nonnegative garrote

Technometrics

Comparison of discrimination methods for the classification of tumors using gene expression data

J. Amer. Statist. Assoc.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

Gene selection for cancer classification using support vector machines

Mach. Learn.

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks

Nature Med.