Abstract
Machine learning and data mining have found a multitude of successful applications in microarray analysis, with gene clustering and classification of tissue samples being widely cited examples. Low-level microarray analysis -- often associated with the pre-processing stage within the microarray life-cycle -- has increasingly become an area of active research, traditionally involving techniques from classical statistics. This paper explores opportunities for the application of machine learning and data mining methods to several important low-level microarray analysis problems: monitoring gene expression, transcript discovery, genotyping and resequencing. Relevant methods and ideas from the machine learning community include semi-supervised learning, learning from heterogeneous data, and incremental learning.
- Affymetrix. Affymetrix Microarray Suite Guide. Affymetrix Inc., Santa Clara, CA, 2001. version 5.0.Google Scholar
- M. Schena. DNA Microarrays: A Practical Approach. Oxford University Press, 1999.Google Scholar
- Affymetrix. Statistical algorithms description document. Whitepaper, Affymetrix Inc., Santa Clara, CA, 2002.Google Scholar
- A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In Fourteenth Conference on Uncertainty in Artificial Intelligence, pages 148--155. Morgan Kaufmann Publishers, 1998. Google ScholarDigital Library
- P. K. Varshney. Scanning the issue: Special issue on data fusion. Proceedings of the IEEE, 85:3--5, 1997.Google Scholar
- R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, New York, 1973. Google ScholarDigital Library
- T. Gaasterland and S. Bekiranov. Making the most of microarray data. Nature Genetics, 24:204--206, 2000.Google ScholarCross Ref
- K. P. Bennett and A. Demiriz. Semi-supervised support vector machines. In Advances in Neural Information Processing Systems 11, pages 368--374, Cambridge, MA, 1999. MIT Press. Google ScholarDigital Library
- T. Joachims. Transductive learning via spectral graph partitioning. In Proceedings of the International Conference on Machine Learning (ICML), 2003.Google Scholar
- V. Castelli and T. Cover. On the exponential value of labeled samples. Pattern Recognition Letters, 16:105--111, 1995. Google ScholarDigital Library
- R. A. Irizarry. Science and Statistics: A Festschrift for Terry Speed, volume 40 of Lecture Notes--Monograph Series, chapter Measures of gene expression for Affymetrix high density oligonucleotide arrays, pages 391--402. Institute of Mathematical Statistics, 2003.Google Scholar
- R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299--314, 1996.Google Scholar
- N. Friedman. Probabilistic models for identifying regulation networks. Bioinformatics, 19:II57, October 2003.Google Scholar
- E. Hubbell, W. M. Liu, and R. Mei. Robust estimators for expression analysis. Bioinformatics, 18:1585--1592, 2002.Google ScholarCross Ref
- Bioconductor Core. An overview of projects in computing for genomic analysis. Technical report, The Bioconductor Project, 2002.Google Scholar
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann Publishers, 1998. Google ScholarDigital Library
- L. Wu, S. L. Oviatt, and P. R. Cohen. Multimodal integration - a statistical view. IEEE Transactions on Multimedia, 1:334 --341, 1999. Google ScholarDigital Library
- A. J. Hartemink and E. Segal. Joint learning from multiple types of genomic data. In Proceedings of the Pacific Symposium on Biocomputing 2004, 2004.Google Scholar
- G. K. Smyth, Y. H. Yang, and T. P. Speed. Functional Genomics: Methods and Protocols, volume 224 of Methods in Molecular Biology, chapter Statistical issues in cDNA microarray data analysis, pages 111--136. Humana Press, Totowa, NJ, 2003.Google Scholar
- A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In International Conference on Machine Learning (ICML), 2001. Google ScholarDigital Library
- R. Klinkenberg and T. Joachims. Detecting concept drift with support vector machines. In P. Langley, editor, Proceedings of ICML-00, 17th International Conference on Machine Learning, pages 487--494, Stanford, CA, 2000. Morgan Kaufmann Publishers. Google ScholarDigital Library
- M. Szummer and T. Jaakkola. Partially labeled classification with Markov random walks. In Neural Information Processing Systems (NIPS), 2001.Google Scholar
- T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems 11: Proceedings of the 1998 Conference, pages 487--493. MIT Press, 1998. Google ScholarDigital Library
- G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In NIPS, pages 409--415, 2000.Google Scholar
- T. Joachims. Transductive inference for text classification using support vector machines. In I. Bratko and S. Dzeroski, editors, Proceedings of the 16th Annual Conference on Machine Learning, pages 200--209. Morgan Kaufmann, 1999. Google ScholarDigital Library
- O. Chapelle, V. Vapnik, and J. Weston. Advances in Neural Information Processing Systems 12, chapter Transductive inference for estimating values of functions. MIT Press, 2000.Google Scholar
- M. Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270:467--470, 1995.Google ScholarCross Ref
- D. J. Lockhart, H. Dong, M. C. Byrne, M. T. Follettie, M. V. Gallo, M. S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, and E. L. Brown. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnology, 14:1675--1680, 1996.Google ScholarCross Ref
- R. J. Lipshutz, S. P. A. Fodor, T. R. Gingeras, and D. H. Lockhart. High density synthetic oligonucleotide arrays. Nature Genetics, 21:20--24, 1999. Supplement.Google ScholarCross Ref
- J. B. Fan, D. Gehl, L. Hsie, K. Lindblad-Toh, J. P. Laviolette, E. Robinson, R. Lipshutz, D. Wang, T. J. Hudson, and D. Labuda. Assessing DNA sequence variations in human ests in a phylogenetic context using high-density oligonucleotide arrays. Genomics, 80:351--360, September 2002.Google ScholarCross Ref
- D. J. Cutler, M. E. Zwick, M. M. Carrasquillo, C. T. Yohn, K. P. Tobin, C. Kashuk, D. J. Mathews, N. A. Shah, E. E. Eichler, J. A. Warrington, and A. Chakravarti. High-throughput variation detection and genotyping using microarrays. Genome Research, 11:1913--1925, November 2001.Google ScholarCross Ref
- J. B. Fan, X. Chen, M. K. Halushka, A. Berno, X. Huang, T. Ryder, R. J. Lipshutz, D. J. Lockhart, and A. Chakravarti. Parallel genotyping of human SNPs using generic high-density oligonucleotide tag arrays. Genome Research, 10:853--860, June 2000.Google ScholarCross Ref
- G. C. Kennedy, H. Matsuzaki, D. Dong, W. Liu, J. Huang, G. Liu, X. Su, M. Cao, W. Chen, J. Zhang, W. Liu, G. Yang, X. Di, T. Ryder, Z. He, U. Surti, M. S. Phillips, M. T. Boyce-Jacino, S. P. A. Fodor, and K. W. Jones. Large-scale genotyping of complex DNA. Nature Biotechnology, October 2003.Google ScholarCross Ref
- P. Kapranov, S. E. Cawley, J. Drenkow, S. Bekiranov, R. L. Strausberg, S. P. A. Fodor, and T. R. Gingeras. Large-scale transcriptional activity in chromosomes 21 and 22. Science, 296:916--919, 2002.Google ScholarCross Ref
- M. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, M. Ares Jr, and D. Haussler. Support vector machine classification of microarray gene expression data. Technical Report UCSC-CRL-99-09, Department of Computer Science, University of California at Santa Cruz, 1999.Google Scholar
- M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares Jr, and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, 97:262--267, 1997.Google ScholarCross Ref
- P. Pavlidis, J. Weston, J. Cai, and W. N. Grundy. Gene functional classification from heterogeneous data. In Proceedings of the Fifth International Conference on Computational Molecular Biology, pages 242--248, 2001. Google ScholarDigital Library
- T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16:906--914, 2000.Google ScholarCross Ref
- S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. P. Mesirov, and T. Poggio. Support vector machine classification of microarray data. Technical Report 182, Center for Biological and Computational Learning Massachusetts Institute of Technology, 1998.Google Scholar
- S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. P. Mesirov, T. Poggio, W. Gerald, M. Loda, E. S. Lander, and T. R. Golub. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences, 98, 2001.Google Scholar
- C. Yeang, S. Ramaswamy, P. Tamayo, S. Mukherjee, R. M. Rifkin, M. Angelo, M. Reich, E. Lander, J. Mesirov, and T. Golub. Molecular classification of multiple tumor types. Bioinformatics, 1:1--7, 2001.Google Scholar
- F. G. Cozman and I. Cohen. Unlabeled data can degrade classification performance of generative classifiers. In Fifteenth International Florida Artificial Intelligence Society Conference, pages 327--331, 2002. Google ScholarDigital Library
- T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classification problems. In Proceedings of the International Conference on Machine Learning, pages 1191--1198, 2000. Google ScholarDigital Library
- K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39:103--134, 2000. Google ScholarDigital Library
- T. Li, S. Zhu, Q. Li, and M. Ogihara. Gene functional classification by semi-supervised learning from heterogeneous data. In Proceedings of the ACM Symposium on Applied Computing, 2003. Google ScholarDigital Library
- G. R. G. Lanckriet, M. Deng, N. Cristianini, M. I. Jordan, and W. S. Noble. Kernel-based data fusion and its application to protein function prediction in yeast. In Proceedings of the Pacific Symposium on Biocomputing 2004, 2004.Google Scholar
- A. Shilton, M. Palaniswami, D. Ralph, and A. C. Tsoi. Incremental training in support vector machines. In Proceedings of the International Joint Conference on Neural Networks, 2001.Google Scholar
- C. P. Diehl. Toward Efficient Collaborative Classification for Distributed Video Surveillance. PhD thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University, 2000. Google ScholarDigital Library
- J. H. Maindonald, Y. E. Pittelkow, and S. R. Wilson. Science and Statistics: A Festschrift for Terry Speed, volume 40 of IMS Lecture Notes--Monograph Series, chapter Some Considerations for the Design of Microarray Experiments, pages 367--390. Institute of Mathematical Statistics, 2003.Google Scholar
- R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Berclay, K. J. Antonellis, U. Scherf, and T. P. Speed. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4:249--264, 2003.Google ScholarCross Ref
- C. Li and W. H. Wong. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proceedings of the National Academy of Science, 98:31--36, 2001.Google ScholarCross Ref
- B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics, 19:185--193, 2003.Google ScholarCross Ref
- B. I. P. Rubinstein and T. P. Speed. Detecting gene expression with oligonucleotide microarrays, 2003. manuscript in preparation.Google Scholar
- W. Liu, R. Mei, D. M. Bartell, X. Di, T. A. Webster, and T. Ryder. Rank-based algorithms for analysis of microarrays. Proceedings of SPIE, Microarrays: Optical Technologies and Informatics, 4266, 2001.Google Scholar
- W. M. Liu, R. Mei, X. Di, T. B. Ryder, E. Hubbell, S. Dee, T. A. Webster, C. A. Harrington, M. H. Ho, J. Baid, and S. P. Smeekens. Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics, 18:1593--1599, 2002.Google ScholarCross Ref
- F. Provost and T. Fawcett. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Third International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, 1997.Google Scholar
- T. Fawcett, F. Provost, and R. Kohavi. The case against accuracy estimation for comparing induction algorithms. In Fifteenth International Conference on Machine Learning, 1998. Google ScholarDigital Library
- D. Kampa and et al. Novel RNAs identified from a comprehensive analysis of the transcriptome of human chromosomes 21 and 22. Manuscript in preparation.Google Scholar
- W.-M. Liu, X. Di, G. Yang, H. Matsuzaki, J. Huang, R. Mei, T. B. Ryder, T. A. Webster, S. Dong, G. Liu, K. W. Jones, G. C. Kennedy, and D. Kulp. Algorithms for large scale genotyping microarrays. Bioinformatics, 2003. In press.Google Scholar
- Affymetrix. GeneChip CustomSeq resequencing array: Performance data for base calling algorithm in GeneChip DNA analysis software. Technical note, Affymetrix Inc., Santa Clara, CA, 2003. http://www.affymetrix.com/support/technical/technotes/customseq_technote.pdf.Google Scholar
Index Terms
- Machine learning in low-level microarray analysis
Recommendations
Microarray analysis of autoimmune diseases by machine learning procedures
Microarray-based global gene expression profiling, with the use of sophisticated statistical algorithms is providing new insights into the pathogenesis of autoimmune diseases. We have applied a novel statistical technique for gene selection based on ...
The impact of RNA-seq aligners on gene expression estimation
BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health InformaticsWhile numerous RNA-seq data analysis pipelines are available, research has shown that the choice of pipeline influences the results of differentially expressed gene detection and gene expression estimation. Gene expression estimation is a key step in ...
Transductive Multilabel Learning via Label Set Propagation
The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Comments