Abstract
High-throughput array-based assays have been widely used for diagnostics and biomarker discovery. However, the development of these assays requires analysis of high-dimensional genomic data and clinical validation of the resulting models, which remain challenging. In this chapter, we describe all the steps of array-based data analysis from data quality control and normalization to higher-level analyses such as clustering, dimensionality reduction, and predictive modeling, with special emphasis on the pitfalls and dangers of such analyses. We then tackle the problems related to clinical validation of array-based biomarkers and predictive assays, which include reproducibility and portability of the initial discovery and its translation into clinics. As array-based data, and those generated by the next-generation sequencing technologies, become less expensive to produce and more widely available, the growing number of patients for whom we have genomic data will open new opportunities for development of robust and reliable genomic biomarkers—provided we apply the lessons we have learned from this last decade of array-based studies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Although it may be helpful to think of the hazard as the instantaneous probability of an event at time t, this quantity is not a probability and may be greater than 1. This is due to the division by Δt. Although the hazard has no upper bound, it cannot be smaller than 0.
- 4.
- 5.
- 6.
References
Affymetrix (2004) GeneChip expression analysis: data analysis fundamentals, vol 2447, pp 1–42. doi:10.1002/jnr.10268
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511
Allison PD, Inc. SI (eds) (1995) Survival analysis using SAS: a practical guide. SAS Institute Inc., Cary, NC
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12):6745–6750
Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97(18):10101–10106. doi:97/18/10101 [pii]
Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 99(10):6562–6566. doi:10.1073/pnas.102102699
Bach FR, Jordan MI (2003) Kernel independent component analysis. J Mach Learn Res 3:1–48
Bair E, Tibshirani R (2004) Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol 2(4):511–522
Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R (2005) NCBI GEO: mining millions of expression profiles—database and tool. Nucleic Acids Res 33:D562
Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, Hanash S (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med 8(8):816–824
Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. Proc Pac Symp Biocomput 7:6–17
Benito M, Parker J, Du Q, Wu J, Xiang D, Perou CM, Marron JS (2004) Adjustment of systematic microarray data biases. Bioinformatics 20(1):105–114
Berrer DP, Dubitzky W, Granzow M (2002) A practical approach to microarray data analysis, 1st edn. Springer, New York
Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 98(24):13790–13795
Bishop CM, Jordan M, Kleinberg J, Scholkopf B (eds) (2006) Pattern recognition and machine learning information science and statistics. Springer, New York
Bloom G, Yang IV, Boulware D, Kwong KY, Coppola D, Eschrich S, Quackenbush J, Yeatman TJ (2004) Multi-platform, multi-site, microarray-based human tumor classification. Am J Pathol 164(1):9–16
Bolstad BM (2004) Low-level analysis of high-density oligonucleotide array data: background normalization and summarization. University of California, Berkeley
Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2):185–193
Boulesteix AL, Porzelius C, Daumer M (2008) Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value. Bioinformatics 24(15):1698–1706. doi:btn262 [pii]10.1093/bioinformatics/btn262
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman and Hall, New York
Bylesjo M, Eriksson D, Sjodin A, Jansson S, Moritz T, Trygg J (2007) Orthogonal projections to latent structures as a strategy for microarray data normalization. BMC Bioinformatics 8:207. doi:1471-2105-8-207 [pii]10.1186/1471-2105-8-207
Cardoso F, Piccart-Gebhart M, Van’t Veer L, Rutgers E (2007) The MINDACT trial: the first prospective clinical validation of a genomic tool. Mol Oncol 1(3):246–251. doi:S1574-7891(07)00077-4 [pii]10.1016/j.molonc.2007.10.004
Caruana R, Niculescu-Mizil A (2004) Data mining in metric space: an empirical analysis of supervised learning performance criteria. Paper presented at the ACM SIGKDD international conference on Knowledge discovery and data mining, New York
Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, Liu C (2011) Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One 6(2):e17238. doi:10.1371/journal.pone.0017238
Cheng Y, Church GM (2000) Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol 8:93–103
Cobleigh MA, Tabesh B, Bitterman P, Baker J, Cronin M, Liu ML, Borchik R, Mosquera JM, Walker MG, Shak S (2005) Tumor gene expression and prognosis in breast cancer patients with 10 or more positive lymph nodes. Clin Cancer Res 11(24 Pt 1):8623–8631. doi:11/24/8623 [pii]10.1158/1078-0432.CCR-05-0735
Collobert R, Bengio S (2001) SVMTorch: support vector machines for large-scale regression problems. J Mach Learn Res 1:143–160
Contopoulos-Ioannidis DG, Alexiou GA, Gouvias TC, Ioannidis JP (2008) Medicine. Life cycle of translational research for medical interventions. Science 321(5894):1298–1299. doi:321/5894/1298 [pii]10.1126/science.1160622
Cox DR (1972) Regression models and life tables. J R Stat Soc Ser B 34:187–220
Cristianini N, Press CCU, Shawe-Taylor J (eds) (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Dasarathy BV (ed) (1990) Nearest neighbor: pattern classification techniques. IEEE Computer Society Press, New York
Davis CA, Gerick F, Hintermair V, Friedel CC, Fundel K, Kuffner R, Zimmer R (2006) Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics 22(19):2356–2363. doi:10.1093/bioinformatics/btl400
De Smet F, Mathys J, Marchal K, Thijs G, De Moor B, Moreau Y (2002) Adaptive quality-based clustering of gene expression profiles. Bioinformatics 18(5):735–746
de Souto M, Costa I, de Araujo D, Ludermir T, Schliep A (2008) Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 9(1):497. doi:10.1186/1471-2105-9-497
DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, Chen Y, Su YA, Trent JM (1996) Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet 14(4):457–460
Desmedt C, Piette F, Loi SM, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, d’Assignies MS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JGM, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C, Consortium T (2007) Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res 13(11):3207–3214
Desmedt C, Haibe-Kains B, Wirapati P, Buyse M, Larsimont D, Bontempi G, Delorenzi M, Piccart M, Sotiriou C (2008) Biological processes associated with breast cancer clinical outcome depend on the molecular subtypes. Clin Cancer Res 14(16):5158–5165. doi:10.1158/1078-0432.CCR-07-4756
Duda RO, Hart PR, Stork DG (2001) Pattern classification. Wiley, New York
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97(457):77–87
Dupuy A, Simon RM (2007) Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst 99(2):147–157. doi:10.1093/jnci/djk018
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Eisen M, Spellman P, Brown P, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. PNAS 95:14863–14868
Eng-Wong J, Zujewski JA (2008) Current NCI-sponsored cooperative group trials of endocrine therapies in breast cancer. Cancer 112(3 Suppl):723–729. doi:10.1002/cncr.23188
Finak G, Bertos N, Pepin F, Sadekova S, Souleimanova M, Zhao H, Chen H, Omeroglu G, Meterissian S, Omeroglu A, Hallett M, Park M (2008) Stromal gene expression predicts clinical outcome in breast cancer. Nat Med 14(5):518–527. doi:10.1038/nm1764
Fisher RA (2011) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631. doi:10.1198/016214502760047131
Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10):906–914
Gamberger D, Lavrac N (2004) Avoiding data overfitting in scientific discovery: experiments in functional genomics. Paper presented at the ECAI, 22–27 Aug 2004, Valencia, Spain
Gentleman R (2005) Reproducible research: a bioinformatics case study. Stat Appl Genet Mol Biol 4(1)
Gentleman R, Huber W, Carey VJ, Irizarry RA, Dudoit S (2005) Bioinformatics and computational biology solutions using R and bioconductor. Springer, New York
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Gui J, Li H (2005) Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 21(13):3001–3008. doi:10.1093/bioinformatics/bti422
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Habel LA, Shak S, Jacobs MK, Capra A, Alexander C, Pho M, Baker J, Walker M, Watson D, Hackett J, Blick NT, Greenberg D, Fehrenbacher L, Langholz B, Quesenberry CP (2006) A population-based study of tumor gene expression and risk of breast cancer death among lymph node-negative patients. Breast Cancer Res 8(3):R25. doi:bcr1412 [pii]10.1186/bcr1412
Haibe-Kains B, Desmedt C, Sotiriou C, Bontempi G (2008) A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? Bioinformatics 24(19):2200–2208. doi:10.1093/bioinformatics/btn374
Haibe-Kains B, Desmedt C, Loi SM, Culhane AC, Bontempi G, Quackenbush J, Sotiriou C (2012) A three-gene model to robustly identify breast cancer molecular subtypes. J Natl Cancer Inst 104(4):311–325. doi:10.1093/jnci/djr545
Harr B, Schlotterer C (2006) Comparison of algorithms for the analysis of Affymetrix microarray data as evaluated by co-expression of genes in known operons. Nucleic Acids Res 34(2):8
Harrell FJ, Lee K, Mark D (1996) Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 15(4):361–387. doi:10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4
Hastie T, Tibshirani R (1990) Generalized additive models. Chapman and Hall, London
Hastie T, Bickel P, Tibshirani R, Diggle P, Friedman J, Fienberg S, Gather U, Otkin I, Zeger S (eds) (2001) The elements of statistical learning statistics. Springer, New York
Heagerty PJ, Lumley T, Pepe MS (2000) Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 56:337–344
Hu H, Li J-Y, Wang H, Daggard G, Wang L-Z (2008) Robustness analysis of diversified ensemble decision tree algorithms for Microarray data classification. Paper presented at the 2008 International Conference on Machine Learning and Cybernetics (ICMLC), Kunming, 12–15 Jul 2008
Huber W, von Heydebreck A, Sultman H, Poustka A, Vingron M (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(1):S96–S104
Irizarry RA, Boldstad BM, Collin F, Cope LM, Hobbs B, Speed TR (2003a) Summaries of affymetrix GeneChip probe level data. Nucleic Acids Res 31(4)
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP (2003b) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (Oxford, England) 4(2):249–264. doi:10.1093/biostatistics/4.2.249
Jin R, Si L, Chan C (2008) A Bayesian framework for knowledge driven regression model in micro-array data analysis. Int J Data Min Bioinform 2(3):250–267
Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1):118–127. doi:10.1093/biostatistics/kxj037
Jolliffe IT, Jolliffe IT (eds) (2002) Principal component analysis. Springer series in statistics. Springer, New York
Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53:451–457
Kelemen A, Zhou H, Lawhead P, Liang Y (2003) Naive Bayesian classifier for microarray data. In: 2003 International joint conference on neural networks, vol 3, pp 1769–1773. Paper presented at the 2003 international joint conference on neural networks, IEEE. doi:10.1109/IJCNN.2003.1223675
Kelley RK, Wang G, Venook AP (2011) Biomarker use in colorectal cancer therapy. J Natl Compr Canc Netw 9(11):1293–1302. doi:9/11/1293 [pii]
Khan J, Simon R, Bittner M, Chen Y, Leighton SB, Pohida T, Smith PD, Jiang Y, Gooden GC, Trent JM, Meltzer PS (1998) Gene expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays. Cancer Res 58(22):5009–5013
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Kohli-Laven N, Bourret P, Keating P, Cambrosio A (2011) Cancer clinical trials in the era of genomic signatures: biomedical innovation, clinical utility, and regulatory-scientific hybrids. Soc Stud Sci 41(4):487–513
Lee JW, Lee JB, Park M, Song SH (2005) An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis 48(4):869–885. doi: 10.1016/j.csda.2004.03.017
Lehmann EL, Caselia G (1998) Theory of point estimation, 2nd edn. Springer, New York
Leisch F (2002) Sweave. Dynamic generation of statistical reports using literate data analysis. In: Computational statistics, vol 69, pp 575–580. Presented at the computational statistics, SFB adaptive information systems and modelling in economics and management science, WU Vienna University of Economics and Business. http://www.google.ca/url?sa=t&rct=j&q=sweave.%20dynamic%20generation%20of%20statistical%20reports%20using%20literate%20data%20analysis&source=web&cd=1&ved=0CDQQFjAA&url=http%3A%2F%2Fepub.wu.ac.at%2F1788%2F1%2Fdocument.pdf&ei=qiVNT7TPLevTiALGwp2wDw&usg=AFQjCNGZ5hg-vOqrB2j6hU7HGhQkhiBrRg&sig2=dmMu57Xag5ci-fANUqxnAA
Li C, Wong WH (2001) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol 2(8):1–11
Lipshutz RJ, Morris D, Chee M, Hubbell E, Kozal MJ, Shah N, Shen N, Yang R, Fodor SP (1995) Using oligonucleotide probe arrays to access genetic diversity. Biotechniques 19(3):442–447
Loi SM, Haibe-Kains B, Desmedt C, Wirapati P, Lallemand F, Tutt AM, Gillet C et al (2008) Predicting prognosis using molecular profiling in estrogen receptor-positive breast cancer treated with tamoxifen. BMC Genomics 9:239. doi:10.1186/1471-2164-9-239
Loi SM, Haibe-Kains B, Desmedt C, Wirapati P, Lallemand F, Tutt AM, Gillet C, Ellis P, Ryder K, Reid JF, Daidone MG, Pierotti MA, Berns EM, Jansen MP, Foekens JA, Delorenzi M, Bontempi G, Piccart MJ, Sotiriou C (2008) Predicting prognosis using molecular profiling in estrogen receptor-positive breast cancer treated with tamoxifen. BMC Genomics 9:239. doi:10.1186/1471-2164-9-239
Loi SM, Haibe-Kains B, Majjaj S, Lallemand F, Durbecq V, Larsimont D, Gonzalez-Angulo AM, Pusztai L, Symmans WF, Bardelli A, Ellis P, Tutt ANJ, Gillett CE, Hennessy BT, Mills GB, Phillips WA, Piccart MJ, Speed TP, McArthur GA, Sotiriou C (2010) PIK3CA mutations associated with gene signature of low mTORC1 signaling and better outcomes in estrogen receptor-positive breast cancer. Proc Natl Acad Sci USA 107(22):10208–10213. doi:10.1073/pnas.0907011107
Luo J, Schumacher M, Scherer A, Sanoudou D, Megherbi D, Davison T, Shi T, Tong W, Shi L, Hong H, Zhao C, Elloumi F, Shi W, Thomas R, Lin S, Tillinghast G, Liu G, Zhou Y, Herman D, Li Y, Deng Y, Fang H, Bushel P, Woods M, Zhang J (2010) A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J 10(4):278–291. doi:tpj201057 [pii]10.1038/tpj.2010.57
Mamounas E, Budd GT, Miller K (2008) Incorporating the oncotype DX breast cancer assay into community practice: an expert Q and A and case study sampling. Clin Adv Hematol Oncol 6(2):s1–s8
Manilich EA, Ozsoyoglu ZM, Trubachev V, Radivoyevitch T (2011) Classification of large microarray datasets using fast random forest construction. J Bioinform Comput Biol 9(2):251–267. doi: [pii]S021972001100546X
Marchionni L, Wilson RF, Marinopoulos SS, Wolff AC, Parmigiani G, Bass EB, Goodman SN (2007) Impact of gene expression profiling tests on breast cancer outcomes. Evid Rep Technol Assess (Full Rep) 160:1–105
Mason SJ, Graham NE (2002) Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: statistical significance and interpretation. Q J R Meteorol Soc 128(584):2145–2166. doi:10.1256/003590002320603584
Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405(2):442–451
McCall MN, Bolstad BM, Irizarry RA (2010) Frozen robust multiarray analysis (fRMA). Biostatistics (Oxford, England) 11(2):242–253. doi:10.1093/biostatistics/kxp059
McCall MN, Murakami PN, Lukk M, Huber W, Irizarry RA (2011a) Assessing affymetrix GeneChip microarray quality. BMC Bioinformatics 12:137. doi:1471-2105-12-137 [pii]10.1186/1471-2105-12-137
McCall MN, Uppal K, Jaffee HA, Zilliox MJ, Irizarry RA (2011b) The gene expression barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Res 39(Database issue):D1011–D1015. doi:gkq1259 [pii]
Mesirov JP (2010) Computer science accessible reproducible research. Science 327(5964):415–416. doi:327/5964/415 [pii]10.1126/science.1179653
Michiels S, Koscielny S, Hill C (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 365:488–492
Moch H, Schraml P, Bubendorf L, Mirlacher M, Kononen J, Gasser T, Mihatsch MJ, Kallioniemi OP, Sauter G (1999) Identification of prognostic parameters for renal cell carcinoma by cDNA arrays and cell chips. Verh Dtsch Ges Pathol 83:225–232
Mook S, van’t Veer LJ, Rutgers EJ, Piccart-Gebhart MJ, Cardoso F (2007) Individualization of therapy using mammaprint: from development to the MINDACT Trial. Cancer Genomics Proteomics 4(3):147–155
Natsoulis G, El Ghaoui L, Lanckriet GRG, Tolley AM, Leroy F, Dunlea S, Eynon BP, Pearson CI, Tugendreich S, Jarnagin K (2005) Classification of a large microarray data set: algorithm comparison and analysis of drug signatures. Genome Res 15(5):724–736. doi:10.1101/gr.2807605
Nepomuceno-Chamorro I, Azuaje F, Devaux Y, Nazarov PV, Muller A, Aguilar-Ruiz JS, Wagner DR (2011) Prognostic transcriptional association networks: a new supervised approach based on regression trees. Bioinformatics 27(2):252–258. doi:btq645 [pii]10.1093/bioinformatics/btq645
Onitilo AA, Engel JM, Greenlee RT, Mukesh BN (2009) Breast cancer subtypes based on ER/PR and Her2 expression: comparison of clinicopathologic features and survival. Clin Med Res 7(1–2):4–13. doi:10.3121/cmr.2009.825
Osorio YFJ, Prina E, Lang T, Milon G, Davory C, Coppee JY, Regnault B (2008) AffyGCQC: a web-based interface to detect outlying genechips with extreme studentized deviate tests. J Bioinform Comput Biol 6(2):317–334. doi:S0219720008003400 [pii]
Paik S, Shak S, Tang G, Kim C, Bakker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T, Hiller W, Fisher ER, Wickerham DL, Bryant J, Wolmark N (2004) A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med 351(27):2817–2826
Pang S, Havukkala I, Hu Y, Kasabov N (2007) Classification consistency analysis for bootstrapping gene selection. Neural Comput Appl 18(6):527–539
Park MY, Hastie T (2007) L1 regularization path algorithm for generalized linear models. J R Stat Soc 69:659–677
Parkinson H, Sarkans U, Shojatalab M, Contrino S, Coulson R, Farne A, Lara GG, Holloway E, Kapushesky M, Lilja P, Mukherjee G, Oezcimen A, Rayner T, Rocca-Serra P, Sharma A, Sansone S, Brazma A (2005) ArrayExpress: a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 33:D553–D555
Parry RM, Jones W, Stokes TH, Phan JH, Moffitt RA, Fang H, Shi L, Oberthuer A, Fischer M, Tong W, Wang MD (2010) k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction. Pharmacogenomics J 10(4):292–309. doi:10.1038/tpj.2010.56
Perou CM, Jeffrey SS, van de Rijn M, Rees CA, Eisen MB, Ross DT, Pergamenschikov A, Williams CF, Zhu SX, Lee JC, Lashkari D, Shalon D, Brown PO, Botstein D (1999) Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci USA 96(16):9212–9217
Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lonning PE, Borresen-Dale A-L, Brown PO, Botstein D (2000) Molecular portraits of human breast tumours. Nature 406(6797):747–752. doi:10.1038/35021093
Phuong TM, Lee D, Lee KH (2004) Regression trees for regulatory element identification. Bioinformatics 20(5):750–757. doi:10.1093/bioinformatics/btg480 btg480 [pii]
Ploner A, Miller LD, Hall P, Bergh J, Pawitan Y (2005) Correlation test to assess low-level processing of high-density oligonucletide microarray data. BMC Bioinformatics 6(80):1–20
Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870):436–442
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 98(26):15149–15154
Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP (2006) GenePattern 2.0. Nat Genet 38(5):500–501. doi:10.1038/ng0506-500
Rifkin R, Klautau A (2004) In defense of One-Vs-All classification. J Mach Learn Res 5(1):101–141
Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M, Pergamenschikov A, Lee JC, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, Brown PO (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24(3):227–235
Ross JS, Hatzis C, Symmans WF, Pusztai L, Hortobagyi GN (2008) Commercialized multigene predictors of clinical outcome for breast cancer. Oncologist 13(5):477–493. doi:13/5/477 [pii]10.1634/theoncologist.2007-0248
Royston P, Sauerbrei W (2004) A new measure of prognostic separation in survival data. Stat Med 23(5):723–748. doi:10.1002/sim.1621
Sarder P, Schierding W, Cobb JP, Nehorai A (2010) Estimating sparse gene regulatory networks using a bayesian linear regression. IEEE Trans Nanobioscience 9(2):121–131. doi:10.1109/TNB.2010.2043444
Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235):467–470
Schumacher M, Binder H, Gerds TA (2007) Assessment of survival prediction models based on microarray data. Bioinformatics 23(14):1768–1774
Sheng Q, Moreau Y, De Moor B (2003) Biclustering microarray data by Gibbs sampling. Bioinformatics 19(Suppl 2):ii196–ii205. doi:10.1093/bioinformatics/btg1078
Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Schrf U, Thierry-Mieg J, Wang C, Wilson M, Wolber PK, Zhang L, Amur S, Bao W, Barbacioru CC, Lucas AB, Bertholet V, Boysen C, Bromley B, Brown D, Brunner A, Canales R, Cao XM, Cebula TA, Chen JJ, Cheng J, Chu TM, Chudin E, Corson J, Corton JC, Croner LJ, Davies C, Davison TS, Delenstarr G, Deng X, Dorris D, Eklund AC, Fan XH, Fang H, Fulmer-Smentek S, Fuscoe JC, Gallagher K, Ge W, Guo L, Guo X, Hager J, Haje PK, Han J, Han T, Harbottle HC, Harris SC, Hatchwell E, Hauser CA, Hester S, Hong H, Hurban P, Jackson SA, Ji H, Knight CR, Kuo WP, LeClerc JE, Levy S, Li QZ, Liu C, Liu Y, Lombardi MJ, Ma Y, Magnuson SR, Maqsodi B, McDaniel T, Mei N, Myklebost O, Ning B, Novoradovskaya N, Orr MS, Osborn TW, Papallo A, Patterson TA, Perkins RG, Peters EH, Peterson R, Philips KL, Pine PS, Pusztai L, Qian F, Ren H, Rosen M, Rosenzweig BA, Samaha RR, Schena M, Schroth GP, Shchegrova S, Smith DD, Staedtler F, Su Z, Sun H, Szallasi Z, Tezak Z, Thierry-Mieg D, Thompson KL, Tikhonova I, Turpaz Y, Vallanat B, Van C, Walker SJ, Wang SJ, Wang Y, Wolfinger R, Wong A, Wu J, Xiao C, Xie Q, Xu J, Yang W, Zhang L, Zhong S, Zong Y, Slikker W Jr (2006) The MicroArray quality control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24(9):1151–1161
Shi L, Campbell G, Jones WD, Campagne F, Wen Z, Walker SJ, Su Z, Chu TM, Goodsaid FM, Pusztai L, Shaughnessy JD Jr, Oberthuer A, Thomas RS, Paules RS, Fielden M, Barlogie B, Chen W, Du P, Fischer M, Furlanello C, Gallas BD, Ge X, Megherbi DB, Symmans WF, Wang MD, Zhang J, Bitter H, Brors B, Bushel PR, Bylesjo M, Chen M, Cheng J, Chou J, Davison TS, Delorenzi M, Deng Y, Devanarayan V, Dix DJ, Dopazo J, Dorff KC, Chou J, Davison TS, Delorenzi M, Deng Y, Devanarayan V, Dix DJ, Dopazo J, Dorff KC, Elloumi F, Fan J, Fan S, Fan X, Fang H, Gonzaludo N, Hess KR, Hong H, Huan J, Irizarry RA, Judson R, Juraeva D, Lababidi S, Lambert CG, Li L, Li Y, Li Z, Lin SM, Liu G, Lobenhofer EK, Luo J, Luo W, McCall MN, Nikolsky Y, Pennello GA, Perkins RG, Philip R, Popovici V, Price ND, Qian F, Scherer A, Shi T, Shi W, Sung J, Thierry-Mieg D, Thierry-Mieg J, Thodima V, Trygg J, Vishnuvajjala L, Wang SJ, Wu J, Wu Y, Xie Q, Yousef WA, Zhang L, Zhang X, Zhong S, Zhou Y, Zhu S, Arasappan D, Bao W, Lucas AB, Berthold F, Brennan RJ, Buness A, Catalano JG, Chang C, Chen R, Cheng Y, Cui J, Czika W, Demichelis F, Deng X, Dosymbekov D, Eils R, Feng Y, Fostel J, Fulmer-Smentek S, Fuscoe JC, Gatto L, Ge W, Goldstein DR, Guo L, Halbert DN, Han J, Harris SC, Hatzis C, Herman D, Huang J, Jensen RV, Jiang R, Johnson CD, Jurman G, Kahlert Y, Khuder SA, Kohl M, Li J, Li M, Li QZ, Li S, Liu J, Liu Y, Liu Z, Meng L, Madera M, Martinez-Murillo F, Medina I, Meehan J, Miclaus K, Moffitt RA, Montaner D, Mukherjee P, Mulligan GJ, Neville P, Nikolskaya T, Ning B, Page GP, Parker J, Parry RM, Peng X, Peterson RL, Phan JH, Quanz B, Ren Y, Riccadonna S, Roter AH, Samuelson FW, Schumacher MM, Shambaugh JD, Shi Q, Shippy R, Si S, Smalter A, Sotiriou C, Soukup M, Staedtler F, Steiner G, Stokes TH, Sun Q, Tan PY, Tang R, Tezak Z, Thorn B, Tsyganova M, Turpaz Y, Vega SC, Visintainer R, von Frese J, Wang C, Wang E, Wang J, Wang W, Westermann F, Willey JC, Woods M, Wu S, Xiao N, Xu J, Xu L, Yang L, Zeng X, Zhang M, Zhao C, Puri RK, Scherf U, Tong W, Wolfinger RD, Consortium M (2010) The MicroArray quality control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol 28(8):827–838. doi:nbt.1665 [pii]
Simon R (2003) Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data. Br J Cancer 89:1599–1604
Slodkowska EA, Ross JS (2009) MammaPrint 70-gene signature: another milestone in personalized medical care for breast cancer patients. Expert Rev Mol Diagn 9(5):417–422. doi:10.1586/erm.09.32
Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Eystein Lonning P, Borresen-Dale AL (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA 98(19):10869–10874
Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geister S, Demeter J, Perou C, Lonning PE, Brown PO, Borresen-Dale A-L, Botstein D (2003) Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA 1(14):8418–8423
Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Van de Vijver MJ, Bergh J, Piccart M, Delorenzi M (2006) Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 98(4):262–272
Steel RGD, Torrie JH (1980) Principles and procedures of statistics. McGraw Hill, New York
Straver ME, Glas AM, Hannemann J, Wesseling J, van de Vijver MJ, Rutgers EJ, Vrancken Peeters MJ, van Tinteren H, van’t Veer LJ, Rodenhuis S (2010) The 70-gene signature as a response predictor for neoadjuvant chemotherapy in breast cancer. Breast Cancer Res Treat 119(3):551–558. doi:10.1007/s10549-009-0333-1
Sugar C (1998) Techniques for clustering and classification with applications to medical problems. Doctoral Thesis, Stanford University
Suzuki K (ed) (2011) Artificial neural networks—methodological advances and biomedical applications. Artifical Neural Network Intech, Croatia
Sweets JA (1988) Measuring the accuracy of diagnostic systems. Science 240(4857):1285–1293
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96(6):2907–2912
Taylor JS, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, New York
The Cancer Letter (2011) Duke accepts potti resignation; retraction process initiated with nature medicine. http://www.cancerletter.com/articles/20101123_1
Therneau TM, Gail M, Grambsch PM, Krickeberg K, Samet JM, Tsiatis A, Wong W (eds) (2000) Modeling survival data: extending the Cox model. Statistics for biology and health. Springer, New York. doi:10.1002/sim.956
Tibshirani R (1997) The lasso method for variable selection in the Cox model. Stat Med 16(4):385–395. doi:10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3 [pii]
Tibshirani R (2001) Regression shrinkage and selection via the lasso. J Royal Statist Soc B 58(1):1267–1288
Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14(3):511–528
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol 63(2):411–423. doi:10.1111/1467-9868.00293
Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 99(10):6567–6572. doi:10.1073/pnas.082099299 99/10/6567 [pii]
Tseng GC, Wong WH (2005) Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics 61(1):10–16. doi:10.1111/j.0006-341X.2005.031032.x
UIshwaran H, Kogalur U, Blackstone E, Lauer M (2008) Random survival forests. Ann Appl Stat 2(3):841–860
Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei LJ (2011) On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat Med 30(10):1105–1117. doi:10.1002/sim.4154
Van Belle V, Pelckmans K, Van Huffel S, Suykens JA (2011a) Improved performance on high-dimensional survival data by application of survival-SVM. Bioinformatics 27(1):87–94. doi:btq617 [pii]10.1093/bioinformatics/btq617
Van Belle V, Pelckmans K, Van Huffel S, Suykens JA (2011b) Support vector methods for survival analysis: a comparison between ranking and regression approaches. Artif Intell Med 53(2):107–118. doi:10.1016/j.artmed.2011.06.006S0933-3657(11)00076-5 [pii]
van de Vijver MJ, He YD, van’t Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R (2002) A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347(25):1999–2009
van der Laan MJ, Pollard KS, Bryan J (2003) A new partitioning around medoids algorithm. J Stat Comput Simulat 73(8):575–584
van Houwelingen H, Bruinsma T, Hart AA, van’t Veer LJ, Wessels LFA (2006) Cross-validated Cox regression on microarray gene expression data. Stat Med 25:3201–3216
van Rijsbergen C (1979) Information retrieval, 2nd edn. Butterworths, London
van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871):530–536
Verweij PJM, van Houwelingen JC (1993) Cross-validation in survival analysis. Stat Med 12:2305–2314
Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EMJJ, Atkins D, Foekens JA (2005) Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365(9460):671–679. doi:10.1016/S0140-6736(05)17947-1
Webb A (2003) Statistical pattern recognition, 2nd edn. Wiley, New York
Wei JS, Greer BT, Westermann F, Steinberg SM, Son CG, Chen QR, Whiteford CC, Bilke S, Krasnoselsky AL, Cenacchi N, Catchpoole D, Berthold F, Schwab M, Khan J (2004) Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma. Cancer Res 64(19):6883–6891. doi:64/19/6883 [pii]10.1158/0008-5472.CAN-04-0695
Weiss SM, Kulikowski CA (1991) Computer systems that learn. Morgan Kaufmann, San Mateo
Welford SM, Gregg J, Chen E, Garrison D, Sorensen PH, Denny CT, Nelson SF (1998) Detection of differentially expressed genes in primary tumor tissues using representational differences analysis coupled to microarray hybridization. Nucleic Acids Res 26(12):3059–3065
Wilson CL, Miller CJ (2005) Simpleaffy: a BioConductor package for affymetrix quality control and data analysis. Bioinformatics 21(18):3683–3685
Wu Z, Irizarry RA (2004) Preprocessing of oligonucleotide array data. Nat Biotechnol 22:656–658
Yeung KY, Bumgarner RE (2003) Multiclass classification of microarray data with repeated measurements: application to cancer. Genome Biol 4(12):R83. doi:10.1186/gb-2003-4-12-r83
Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35
Zhu J, Hastie T (2004) Classification of gene microarrays by penalized logistic regression. Biostatistics 5(3):427–443. doi: 5/3/427 [pii]10.1093/biostatistics/5.3.427
Zilliox MJ, Irizarry RA (2007) A gene expression bar code for microarray data. Nat Methods 4(11):911–913. doi:nmeth1102 [pii]10.1038/nmeth1102
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc SerB Stat Methodol 67(2):301–320. doi:10.1111/j.1467-9868.2005.00503.x
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Haibe-Kains, B., Quackenbush, J. (2012). Analysis of Array Data and Clinical Validation of Array-Based Assays. In: Jordan, B. (eds) Microarrays in Diagnostics and Biomarker Development. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28203-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-28203-4_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28202-7
Online ISBN: 978-3-642-28203-4
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)