Abstract
High-throughput expression profiling allows simultaneous measure of tens of thousands of genes at once. These data have motivated the development of reliable biomarkers for disease subtypes identification and diagnosis. Many methods have been developed in the literature for analyzing these data, such as diagonal discriminant analysis, support vector machines, and k-nearest neighbor methods. The diagonal discriminant methods have been shown to perform well for high-dimensional data with small sample sizes. Despite its popularity, the independence assumption is unlikely to be true in practice. Recently, a gene module based linear discriminant analysis strategy has been proposed by utilizing the correlation among genes in discriminant analysis. However, the approach can be underpowered when the samples of the two classes are unbalanced. In this paper, we propose to correct the biases in the discriminant scores of block-diagonal discriminant analysis. In simulation studies, our proposed method outperforms other approaches in various settings. We also illustrate our proposed discriminant analysis method for analyzing microarray data studies.
Herbert Pang’s research was supported in part by National Institute of Health under Award P01CA142538 and funds from DUMC. Tiejun Tong’s research was supported in part by Hong Kong Research Grants Council under Grant 202711 and HKBU FRGs. Michael Ng’s research was supported in part by Hong Kong Research Grants Council under Grant 201508 and HKBU FRGs. The authors are grateful to the editor, the associate editor, and two reviewers for their constructive comments and suggestions that have led to a substantial improvement in the article.
References
Abramowitz, M. and I. Stegun (1972): Handbook of mathematical functions. New York: Dover.Search in Google Scholar
Anderson, T. W. (1958): An Introduction to multivariate analysis. New York: John Wiley.Search in Google Scholar
Antoniadis, A., S. Lambert-Lacroix and F. Leblanc (2003): “Effective dimension reduction methods for tumor classification using gene expression data,” Bioinformatics, 19, 563–570.10.1093/bioinformatics/btg062Search in Google Scholar PubMed
Asyali, M. H., D. Colak, O. Demirkaya and M. S. Inan (2006): “Gene expression profile classification: a review,” Curr. Bioinformatics, 1, 55–73.Search in Google Scholar
Bickel, P. J. and E. Levina (2004): “Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations,” Bernoulli, 10, 989–1010.10.3150/bj/1106314847Search in Google Scholar
Bodenhofer, U., A. Kothmeier and S. Hochreiter (2011): “APCluster: an R package for affinity propagation clustering,” Bioinformatics, 27, 2463–3464.10.1093/bioinformatics/btr406Search in Google Scholar PubMed
Breiman, L. (2001): “Random forests,” Mach. Learn., 45, 5–32.Search in Google Scholar
Cohen, G., M. Hilario, H. Sax, S. Hugonnet and A. Geissbuhler (2006): “Learning from imbalanced data in surveillance of nosocomial infection,” Artif. Intell. Med., 37, 718.Search in Google Scholar
Dabney, A. R. and J. D. Storey (2007): “Optimality driven nearest centroid classification from genomic data,” PLoS ONE, 2, e1002.10.1371/journal.pone.0001002Search in Google Scholar PubMed PubMed Central
Dai, J., L. Lieu and D. Rocke (2006): “Dimension reduction for classification with gene expression microarray data,” Stat. Appl. Genetics Mol. Biol., 5, 6.Search in Google Scholar
Das Gupta, S. (1968): “Some aspects of discrimination function coefficients,” Sankhya, 30, 387–400.Search in Google Scholar
Dettling, M. (2004): “Bagboosting for tumor classification with gene expression data,” Bioinformatics, 20, 3583–3593.10.1093/bioinformatics/bth447Search in Google Scholar PubMed
Dudoit, S., J. Fridlyand and T. P. Speed (2002): “Comparison of discrimination methods for the classification of tumors using gene expression data,” J. Am. Stat. Assoc., 97, 77–87.Search in Google Scholar
Frey, B. and D. Dueck (2007): “Clustering by passing messages between data points,” Science, 315, 972–976.10.1126/science.1136800Search in Google Scholar PubMed
Friedman, J. H. (1989): “Regularized discriminant analysis,” J. Am. Stat. Assoc., 84, 165–175.Search in Google Scholar
Ghurye, S. G. and I. Own (1969): “Unbiased estimation of some multivariate probability densities and related functions,” Ann. Math. Stat., 40, 1261–1271.Search in Google Scholar
Guo, J. (2010): “Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis,” Biostatistics, 11, 599–608.10.1093/biostatistics/kxq023Search in Google Scholar PubMed
Guo, Y., T. Hastie and R. Tibshirani (2007): “Regularized linear discriminant analysis and its application in microarrays,” Biostatistics, 8, 86–100.10.1093/biostatistics/kxj035Search in Google Scholar PubMed
Heilemann, U. and R. Schuhr (2008): “On the evolution of German business cycles 1958–2004,” Jahrbucher fur Nationalokonomie und Statistik, 228, 84–109.10.1515/jbnst-2008-0107Search in Google Scholar
Horvath, S. and J. Dong (2008): “Geometric interpretation of gene coexpression network analysis,” PLoS Comput. Biol., 4, e1000117.Search in Google Scholar
Hu, P., S. Bull and H. Jiang (2011): “Gene network modules-based liner discriminant analysis of microarray gene expression data,” Lect. Notes Comput. Sci., 6674, 286–296.Search in Google Scholar
Huang, D. and C. Zheng (2006): “Independent component analysis-based penalized discriminant method for tumor classification using gene expression data,” Bioinformatics, 22, 1855–1862.10.1093/bioinformatics/btl190Search in Google Scholar PubMed
Huang, S., T. Tong and H. Zhao (2010): “Bias-corrected diagonal discriminant rules for high-dimensional classification,” Biometrics, 66, 1096–1106.10.1111/j.1541-0420.2010.01395.xSearch in Google Scholar PubMed PubMed Central
Hwang, J. T. G., J. Qiu and Z. Zhao (2009): “Empirical Bayes confidence intervals shrinking both means and variances,” J. Roy. Stat. Soc. B, 71, 265–285.Search in Google Scholar
Langaas, M., B. H. Lindqvist and E. Ferkingstad (2005): “Estimating the proportion of true null hypotheses, with application to DNA microarray data,” J. Roy. Stat. Soc., B, 67, 555–572.Search in Google Scholar
Lee, J. W., J. B. Lee, M. Park and S. H. Song (2005): “An extensive comparison of recent classification tools applied to microarray data,” Comput. Stat. Data An., 48, 869–885.Search in Google Scholar
Lee, Y. K., Y. Lin and G. Wahba (2004): “Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data,” J. Am. Stat. Assoc., 99, 67–81.Search in Google Scholar
McLachlan, G. J. (1992): Discriminant Analysis and Statistical Pattern Recognition. New York: Wiley-Interscience, John Wiley & Sons.10.1002/0471725293Search in Google Scholar
Moran, M. A. and B. J. Murphy (1979): “A closer look at two alternative methods of statistical discrimination,” Appl. Stat., 28, 223–232.Search in Google Scholar
Natowicz, R., R. Incitti, E. G. Horta, B. Charles, P. Guinot, K. Yan, C. Coutant, F. Andre, L. Pusztai and R. Rouzier (2008): “Prediction of the outcome of preoperative chemotherapy in breast cancer using DNA probes that provide information on both complete and incomplete responses,” BMC Bioinformatics, 9, 149.10.1186/1471-2105-9-149Search in Google Scholar PubMed PubMed Central
Noushath, S., G. H. Kumar and P. Shivakumara (2006): “Diagonal Fisher linear discriminant analysis for efficient face recognition,” Neurocomputing, 69, 1711–1716.10.1016/j.neucom.2006.01.012Search in Google Scholar
Pang, H., T. Tong and H. Zhao (2009): “Shrinkage-based diagonal discriminant analysis and its applications in high-dimensional data,” Biometrics, 65, 1021–1029.10.1111/j.1541-0420.2009.01200.xSearch in Google Scholar PubMed PubMed Central
Qiao, X. and Y. Liu (2009): “Adaptive weighted learning for unbalanced multicategory classification,” Biometrics, 65, 159–168.10.1111/j.1541-0420.2008.01017.xSearch in Google Scholar PubMed
Shieh, G. S., Y. C. Jiang and Y. S. Shih (2006): “Comparison of support vector machines to other classifiers using gene expression data,” Commun. Stat. Simul. C., 35, 241–256.Search in Google Scholar
Shipp, M. A., K. N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. S. Pinkus, T. S. Ray, M. A. Koval, K. W. Last, A. Norton, T. A. Lister, J. Mesirov, D. S. Neuberg, E. S. Lander, J. C. Aster and T. R. Golub (2002): “Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning,” Nat. Med., 8, 68–74.Search in Google Scholar
Son, B. and Y. Lee (2006): “The fusion of two user-friendly biometric modalities: iris and face,” IEICE T. Inf. Syst., e89- d, 372–376.Search in Google Scholar
Speed, R. (2003): Statistical analysis of gene expression microarray data. London: Chapman and Hall.10.1201/9780203011232Search in Google Scholar
Statnikov, A., L. Wang and C. F. Aliferis (2008): “A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification,” BMC Bioinformatics, 9, 319.10.1186/1471-2105-9-319Search in Google Scholar PubMed PubMed Central
Storey, J. D. and R Tibshirani (2001): Estimating the positive false discovery rate under dependence, with applications to DNA microarrays. Technical Report 2001–28, Department of Statistics, Stanford University.Search in Google Scholar
Suthram, S., J. Dudley, A. Chiang, R. Chen, T. Hastie and A. Butte (2010): “Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets,” PLoS Comput. Biol., 6, e1000662.Search in Google Scholar
Taylor I, R. Linding, D. Warde-Farley, Y. Liu, C. Pesquita, D. Faria, S. Bull, T. Pawson, Q. Morris and J. Wrana (2009): “Dynamic modularity in protein interaction networks predicts breast cancer outcome,” Nat. Biotechnol., 27, 199–204.Search in Google Scholar
Tibshirani, R., T. Hastie, B. Narasimhan and G. Chu (2002): “Diagnosis of multiple cancer types by shrunken centroids of gene expression,” Proc. Natl. Acad. Sci., 99, 6567–6572.Search in Google Scholar
Tibshirani, R., T. Hastie, B. Narasimhan and G. Chu (2003): “Class prediction by nearest shrunken centroids, with applications to DNA microarrays,” Stat. Sci., 18, 104–117.Search in Google Scholar
Tong, T. and Y. Wang (2007): “Optimal shrinkage estimation of variances with applications to microarray data analysis,” J. Am. Stat. Assoc., 102, 113–122.Search in Google Scholar
Vapnik, V. and S. Kotz (2006): Estimation of Dependences Based on Empirical Data. New York: Springer.10.1007/0-387-34239-7Search in Google Scholar
Wald, P. M. and R. A. Kronmal (1977): “Discriminant functions when covariates are unequal and sample sizes are moderate,” Biometrics, 33, 479–484.10.2307/2529362Search in Google Scholar
Wang, S. and J. Zhu (2007): “Improved centroids estimation for the nearest shrunken centroid classifier,” Bioinformatics, 23, 972–979.10.1093/bioinformatics/btm046Search in Google Scholar PubMed
Wang, S., W. Qiu and R. Zamar (2007): “Clues: a non-parametric clustering method based on local shrinking,” Comput. Stat. Data An., 52, 286–298.Search in Google Scholar
Wong, H., N. Cvijanovich, G. Allen, R. Lin, N. Anas, K. Meyer, R. Freishtat, M. Monaco, K. Odoms, B. Sakthivel, T. Shanley and Genomics of Pediatric SIRS/Septic Shock Investigators (2009): “Genomic expression profiling across the pediatric systemic inflammatory response syndrome, sepsis, and septic shock spectrum,” Crit. Care Med., 37, 1558–1566.Search in Google Scholar
Ye, J., T. Li, T. Xiong and R. Janardan (2004): “Using uncorrelated discriminant analysis for tissue classification with gene expression data,” IEEE/ACM Trans. Comput. Biol. Bioinform., 1, 181–190.Search in Google Scholar
©2013 by Walter de Gruyter Berlin Boston