Copyright © 2007 Published by Elsevier B.V.
High-dimensional pseudo-logistic regression and classification with applications to gene expression data
Available online 28 December 2006.
References and further reading may be available for this article. To view references and further reading you must purchase this article.
Abstract
High dimension low sample size data, like the microarray gene expression levels, pose numerous challenges to conventional statistical methods. In the particular case of binary classification, some classification methods, such as the support vector machine (SVM), can efficiently deal with high-dimensional predictors, but lacks the accuracy in estimating the probability of membership of a class. In contrast, the traditional logistic regression (TLR) effectively estimates the probability of class membership for data with low-dimensional inputs, but does not handle high-dimensional cases. The study bridges the gap between SVM and TLR by their loss functions. Based on the proposed new loss function, a pseudo-logistic regression and classification approach which simultaneously combines the strengths of both SVM and TLR is also proposed. Simulation evaluations and real data applications demonstrate that for low-dimensional data, the proposed method produces regression estimates comparable to those of TLR and penalized logistic regression, and that for high-dimensional data, the new method possesses higher classification accuracy than SVM and, in the meanwhile, enjoys enhanced computational convergence and stability.
Keywords: Bayes optimal rule; Large p and small n data; Logistic regression; Loss function; Support vector machine
Article Outline
- 1. Introduction
- 2. Logistic regression and SVM
- 3. Pseudo-logistic regression
- 3.1. Pseudo-quadratic logistic regression
- 3.2. Comparison between PsLR and other algorithms
- 3.3. Why scaling in PsLR? Difference between regression and classification
- 4. Bias correction for PsLR estimates
- 5. Simulation study
- 6. Real data analysis
- 6.1. Small p large n data
- 6.1.1. Minnesota Storm data
- 6.1.2. Pima indians diabetes data
- 6.1.3. Wisconsin breast cancer data
- 6.2. Large p small n data
- 6.2.1. Leukemia data
- 6.2.2. Breast cancer gene expression data
- 6.2.3. Colon data
- 7. Discussion
- Acknowledgements
- Appendix A. Equivalence between (3.3) and (3.4)–(3.5)
- Appendix B. Equivalence between (3.4)–(3.5) and (3.6)–(3.7)
- Appendix C. Derivation of the algorithm for PsLR
- Appendix D. Proof of Theorem 1
- References







E-mail Article
Add to my Quick Links

Cited By in Scopus (0)






