Block-diagonal discriminant analysis and its bias-corrected rules

Herbert Pang; Tiejun Tong; Michael Ng

doi:10.1515/sagmb-2012-0017

Published by De Gruyter May 11, 2013

Block-diagonal discriminant analysis and its bias-corrected rules

Herbert Pang , Tiejun Tong and Michael Ng

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2012-0017

Showing a limited preview of this publication:

Abstract

High-throughput expression profiling allows simultaneous measure of tens of thousands of genes at once. These data have motivated the development of reliable biomarkers for disease subtypes identification and diagnosis. Many methods have been developed in the literature for analyzing these data, such as diagonal discriminant analysis, support vector machines, and k-nearest neighbor methods. The diagonal discriminant methods have been shown to perform well for high-dimensional data with small sample sizes. Despite its popularity, the independence assumption is unlikely to be true in practice. Recently, a gene module based linear discriminant analysis strategy has been proposed by utilizing the correlation among genes in discriminant analysis. However, the approach can be underpowered when the samples of the two classes are unbalanced. In this paper, we propose to correct the biases in the discriminant scores of block-diagonal discriminant analysis. In simulation studies, our proposed method outperforms other approaches in various settings. We also illustrate our proposed discriminant analysis method for analyzing microarray data studies.

Keywords: bias-correction; block-diagonal; classification; high-dimensional data; linear discriminant analysis

Corresponding author: Tiejun Tong, Department of Mathematics, Hong Kong Baptist University, Hong Kong

Herbert Pang’s research was supported in part by National Institute of Health under Award P01CA142538 and funds from DUMC. Tiejun Tong’s research was supported in part by Hong Kong Research Grants Council under Grant 202711 and HKBU FRGs. Michael Ng’s research was supported in part by Hong Kong Research Grants Council under Grant 201508 and HKBU FRGs. The authors are grateful to the editor, the associate editor, and two reviewers for their constructive comments and suggestions that have led to a substantial improvement in the article.

References

Abramowitz, M. and I. Stegun (1972): Handbook of mathematical functions. New York: Dover.Search in Google Scholar

Anderson, T. W. (1958): An Introduction to multivariate analysis. New York: John Wiley.Search in Google Scholar

Antoniadis, A., S. Lambert-Lacroix and F. Leblanc (2003): “Effective dimension reduction methods for tumor classification using gene expression data,” Bioinformatics, 19, 563–570.10.1093/bioinformatics/btg062Search in Google Scholar PubMed

Asyali, M. H., D. Colak, O. Demirkaya and M. S. Inan (2006): “Gene expression profile classification: a review,” Curr. Bioinformatics, 1, 55–73.Search in Google Scholar

Bickel, P. J. and E. Levina (2004): “Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations,” Bernoulli, 10, 989–1010.10.3150/bj/1106314847Search in Google Scholar

Bodenhofer, U., A. Kothmeier and S. Hochreiter (2011): “APCluster: an R package for affinity propagation clustering,” Bioinformatics, 27, 2463–3464.10.1093/bioinformatics/btr406Search in Google Scholar PubMed

Breiman, L. (2001): “Random forests,” Mach. Learn., 45, 5–32.Search in Google Scholar

Cohen, G., M. Hilario, H. Sax, S. Hugonnet and A. Geissbuhler (2006): “Learning from imbalanced data in surveillance of nosocomial infection,” Artif. Intell. Med., 37, 718.Search in Google Scholar

Dabney, A. R. and J. D. Storey (2007): “Optimality driven nearest centroid classification from genomic data,” PLoS ONE, 2, e1002.10.1371/journal.pone.0001002Search in Google Scholar PubMed PubMed Central

Dai, J., L. Lieu and D. Rocke (2006): “Dimension reduction for classification with gene expression microarray data,” Stat. Appl. Genetics Mol. Biol., 5, 6.Search in Google Scholar

Das Gupta, S. (1968): “Some aspects of discrimination function coefficients,” Sankhya, 30, 387–400.Search in Google Scholar

Dettling, M. (2004): “Bagboosting for tumor classification with gene expression data,” Bioinformatics, 20, 3583–3593.10.1093/bioinformatics/bth447Search in Google Scholar PubMed

Dudoit, S., J. Fridlyand and T. P. Speed (2002): “Comparison of discrimination methods for the classification of tumors using gene expression data,” J. Am. Stat. Assoc., 97, 77–87.Search in Google Scholar

Frey, B. and D. Dueck (2007): “Clustering by passing messages between data points,” Science, 315, 972–976.10.1126/science.1136800Search in Google Scholar PubMed

Friedman, J. H. (1989): “Regularized discriminant analysis,” J. Am. Stat. Assoc., 84, 165–175.Search in Google Scholar

Ghurye, S. G. and I. Own (1969): “Unbiased estimation of some multivariate probability densities and related functions,” Ann. Math. Stat., 40, 1261–1271.Search in Google Scholar

Guo, J. (2010): “Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis,” Biostatistics, 11, 599–608.10.1093/biostatistics/kxq023Search in Google Scholar PubMed

Guo, Y., T. Hastie and R. Tibshirani (2007): “Regularized linear discriminant analysis and its application in microarrays,” Biostatistics, 8, 86–100.10.1093/biostatistics/kxj035Search in Google Scholar PubMed

Heilemann, U. and R. Schuhr (2008): “On the evolution of German business cycles 1958–2004,” Jahrbucher fur Nationalokonomie und Statistik, 228, 84–109.10.1515/jbnst-2008-0107Search in Google Scholar

Horvath, S. and J. Dong (2008): “Geometric interpretation of gene coexpression network analysis,” PLoS Comput. Biol., 4, e1000117.Search in Google Scholar

Hu, P., S. Bull and H. Jiang (2011): “Gene network modules-based liner discriminant analysis of microarray gene expression data,” Lect. Notes Comput. Sci., 6674, 286–296.Search in Google Scholar

Huang, D. and C. Zheng (2006): “Independent component analysis-based penalized discriminant method for tumor classification using gene expression data,” Bioinformatics, 22, 1855–1862.10.1093/bioinformatics/btl190Search in Google Scholar PubMed

Huang, S., T. Tong and H. Zhao (2010): “Bias-corrected diagonal discriminant rules for high-dimensional classification,” Biometrics, 66, 1096–1106.10.1111/j.1541-0420.2010.01395.xSearch in Google Scholar PubMed PubMed Central

Hwang, J. T. G., J. Qiu and Z. Zhao (2009): “Empirical Bayes confidence intervals shrinking both means and variances,” J. Roy. Stat. Soc. B, 71, 265–285.Search in Google Scholar

Langaas, M., B. H. Lindqvist and E. Ferkingstad (2005): “Estimating the proportion of true null hypotheses, with application to DNA microarray data,” J. Roy. Stat. Soc., B, 67, 555–572.Search in Google Scholar

Lee, J. W., J. B. Lee, M. Park and S. H. Song (2005): “An extensive comparison of recent classification tools applied to microarray data,” Comput. Stat. Data An., 48, 869–885.Search in Google Scholar

Lee, Y. K., Y. Lin and G. Wahba (2004): “Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data,” J. Am. Stat. Assoc., 99, 67–81.Search in Google Scholar

McLachlan, G. J. (1992): Discriminant Analysis and Statistical Pattern Recognition. New York: Wiley-Interscience, John Wiley & Sons.10.1002/0471725293Search in Google Scholar

Moran, M. A. and B. J. Murphy (1979): “A closer look at two alternative methods of statistical discrimination,” Appl. Stat., 28, 223–232.Search in Google Scholar

Natowicz, R., R. Incitti, E. G. Horta, B. Charles, P. Guinot, K. Yan, C. Coutant, F. Andre, L. Pusztai and R. Rouzier (2008): “Prediction of the outcome of preoperative chemotherapy in breast cancer using DNA probes that provide information on both complete and incomplete responses,” BMC Bioinformatics, 9, 149.10.1186/1471-2105-9-149Search in Google Scholar PubMed PubMed Central

Noushath, S., G. H. Kumar and P. Shivakumara (2006): “Diagonal Fisher linear discriminant analysis for efficient face recognition,” Neurocomputing, 69, 1711–1716.10.1016/j.neucom.2006.01.012Search in Google Scholar

Pang, H., T. Tong and H. Zhao (2009): “Shrinkage-based diagonal discriminant analysis and its applications in high-dimensional data,” Biometrics, 65, 1021–1029.10.1111/j.1541-0420.2009.01200.xSearch in Google Scholar PubMed PubMed Central

Qiao, X. and Y. Liu (2009): “Adaptive weighted learning for unbalanced multicategory classification,” Biometrics, 65, 159–168.10.1111/j.1541-0420.2008.01017.xSearch in Google Scholar PubMed

Shieh, G. S., Y. C. Jiang and Y. S. Shih (2006): “Comparison of support vector machines to other classifiers using gene expression data,” Commun. Stat. Simul. C., 35, 241–256.Search in Google Scholar

Shipp, M. A., K. N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. S. Pinkus, T. S. Ray, M. A. Koval, K. W. Last, A. Norton, T. A. Lister, J. Mesirov, D. S. Neuberg, E. S. Lander, J. C. Aster and T. R. Golub (2002): “Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning,” Nat. Med., 8, 68–74.Search in Google Scholar

Son, B. and Y. Lee (2006): “The fusion of two user-friendly biometric modalities: iris and face,” IEICE T. Inf. Syst., e89- d, 372–376.Search in Google Scholar

Speed, R. (2003): Statistical analysis of gene expression microarray data. London: Chapman and Hall.10.1201/9780203011232Search in Google Scholar

Statnikov, A., L. Wang and C. F. Aliferis (2008): “A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification,” BMC Bioinformatics, 9, 319.10.1186/1471-2105-9-319Search in Google Scholar PubMed PubMed Central

Storey, J. D. and R Tibshirani (2001): Estimating the positive false discovery rate under dependence, with applications to DNA microarrays. Technical Report 2001–28, Department of Statistics, Stanford University.Search in Google Scholar

Suthram, S., J. Dudley, A. Chiang, R. Chen, T. Hastie and A. Butte (2010): “Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets,” PLoS Comput. Biol., 6, e1000662.Search in Google Scholar

Taylor I, R. Linding, D. Warde-Farley, Y. Liu, C. Pesquita, D. Faria, S. Bull, T. Pawson, Q. Morris and J. Wrana (2009): “Dynamic modularity in protein interaction networks predicts breast cancer outcome,” Nat. Biotechnol., 27, 199–204.Search in Google Scholar

Tibshirani, R., T. Hastie, B. Narasimhan and G. Chu (2002): “Diagnosis of multiple cancer types by shrunken centroids of gene expression,” Proc. Natl. Acad. Sci., 99, 6567–6572.Search in Google Scholar

Tibshirani, R., T. Hastie, B. Narasimhan and G. Chu (2003): “Class prediction by nearest shrunken centroids, with applications to DNA microarrays,” Stat. Sci., 18, 104–117.Search in Google Scholar

Tong, T. and Y. Wang (2007): “Optimal shrinkage estimation of variances with applications to microarray data analysis,” J. Am. Stat. Assoc., 102, 113–122.Search in Google Scholar

Vapnik, V. and S. Kotz (2006): Estimation of Dependences Based on Empirical Data. New York: Springer.10.1007/0-387-34239-7Search in Google Scholar

Wald, P. M. and R. A. Kronmal (1977): “Discriminant functions when covariates are unequal and sample sizes are moderate,” Biometrics, 33, 479–484.10.2307/2529362Search in Google Scholar

Wang, S. and J. Zhu (2007): “Improved centroids estimation for the nearest shrunken centroid classifier,” Bioinformatics, 23, 972–979.10.1093/bioinformatics/btm046Search in Google Scholar PubMed

Wang, S., W. Qiu and R. Zamar (2007): “Clues: a non-parametric clustering method based on local shrinking,” Comput. Stat. Data An., 52, 286–298.Search in Google Scholar

Wong, H., N. Cvijanovich, G. Allen, R. Lin, N. Anas, K. Meyer, R. Freishtat, M. Monaco, K. Odoms, B. Sakthivel, T. Shanley and Genomics of Pediatric SIRS/Septic Shock Investigators (2009): “Genomic expression profiling across the pediatric systemic inflammatory response syndrome, sepsis, and septic shock spectrum,” Crit. Care Med., 37, 1558–1566.Search in Google Scholar

Ye, J., T. Li, T. Xiong and R. Janardan (2004): “Using uncorrelated discriminant analysis for tissue classification with gene expression data,” IEEE/ACM Trans. Comput. Biol. Bioinform., 1, 181–190.Search in Google Scholar

Published Online: 2013-05-11

Published in Print: 2013-06-01

Block-diagonal discriminant analysis and its bias-corrected rules

Abstract

References

Journal and Issue

Articles in the same Issue