skip to main content
article

Machine learning in low-level microarray analysis

Published:01 December 2003Publication History
Skip Abstract Section

Abstract

Machine learning and data mining have found a multitude of successful applications in microarray analysis, with gene clustering and classification of tissue samples being widely cited examples. Low-level microarray analysis -- often associated with the pre-processing stage within the microarray life-cycle -- has increasingly become an area of active research, traditionally involving techniques from classical statistics. This paper explores opportunities for the application of machine learning and data mining methods to several important low-level microarray analysis problems: monitoring gene expression, transcript discovery, genotyping and resequencing. Relevant methods and ideas from the machine learning community include semi-supervised learning, learning from heterogeneous data, and incremental learning.

References

  1. Affymetrix. Affymetrix Microarray Suite Guide. Affymetrix Inc., Santa Clara, CA, 2001. version 5.0.Google ScholarGoogle Scholar
  2. M. Schena. DNA Microarrays: A Practical Approach. Oxford University Press, 1999.Google ScholarGoogle Scholar
  3. Affymetrix. Statistical algorithms description document. Whitepaper, Affymetrix Inc., Santa Clara, CA, 2002.Google ScholarGoogle Scholar
  4. A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In Fourteenth Conference on Uncertainty in Artificial Intelligence, pages 148--155. Morgan Kaufmann Publishers, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. K. Varshney. Scanning the issue: Special issue on data fusion. Proceedings of the IEEE, 85:3--5, 1997.Google ScholarGoogle Scholar
  6. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, New York, 1973. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Gaasterland and S. Bekiranov. Making the most of microarray data. Nature Genetics, 24:204--206, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  8. K. P. Bennett and A. Demiriz. Semi-supervised support vector machines. In Advances in Neural Information Processing Systems 11, pages 368--374, Cambridge, MA, 1999. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Joachims. Transductive learning via spectral graph partitioning. In Proceedings of the International Conference on Machine Learning (ICML), 2003.Google ScholarGoogle Scholar
  10. V. Castelli and T. Cover. On the exponential value of labeled samples. Pattern Recognition Letters, 16:105--111, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. A. Irizarry. Science and Statistics: A Festschrift for Terry Speed, volume 40 of Lecture Notes--Monograph Series, chapter Measures of gene expression for Affymetrix high density oligonucleotide arrays, pages 391--402. Institute of Mathematical Statistics, 2003.Google ScholarGoogle Scholar
  12. R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299--314, 1996.Google ScholarGoogle Scholar
  13. N. Friedman. Probabilistic models for identifying regulation networks. Bioinformatics, 19:II57, October 2003.Google ScholarGoogle Scholar
  14. E. Hubbell, W. M. Liu, and R. Mei. Robust estimators for expression analysis. Bioinformatics, 18:1585--1592, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  15. Bioconductor Core. An overview of projects in computing for genomic analysis. Technical report, The Bioconductor Project, 2002.Google ScholarGoogle Scholar
  16. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann Publishers, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Wu, S. L. Oviatt, and P. R. Cohen. Multimodal integration - a statistical view. IEEE Transactions on Multimedia, 1:334 --341, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. J. Hartemink and E. Segal. Joint learning from multiple types of genomic data. In Proceedings of the Pacific Symposium on Biocomputing 2004, 2004.Google ScholarGoogle Scholar
  19. G. K. Smyth, Y. H. Yang, and T. P. Speed. Functional Genomics: Methods and Protocols, volume 224 of Methods in Molecular Biology, chapter Statistical issues in cDNA microarray data analysis, pages 111--136. Humana Press, Totowa, NJ, 2003.Google ScholarGoogle Scholar
  20. A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In International Conference on Machine Learning (ICML), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Klinkenberg and T. Joachims. Detecting concept drift with support vector machines. In P. Langley, editor, Proceedings of ICML-00, 17th International Conference on Machine Learning, pages 487--494, Stanford, CA, 2000. Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Szummer and T. Jaakkola. Partially labeled classification with Markov random walks. In Neural Information Processing Systems (NIPS), 2001.Google ScholarGoogle Scholar
  23. T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems 11: Proceedings of the 1998 Conference, pages 487--493. MIT Press, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In NIPS, pages 409--415, 2000.Google ScholarGoogle Scholar
  25. T. Joachims. Transductive inference for text classification using support vector machines. In I. Bratko and S. Dzeroski, editors, Proceedings of the 16th Annual Conference on Machine Learning, pages 200--209. Morgan Kaufmann, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. O. Chapelle, V. Vapnik, and J. Weston. Advances in Neural Information Processing Systems 12, chapter Transductive inference for estimating values of functions. MIT Press, 2000.Google ScholarGoogle Scholar
  27. M. Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270:467--470, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  28. D. J. Lockhart, H. Dong, M. C. Byrne, M. T. Follettie, M. V. Gallo, M. S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, and E. L. Brown. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnology, 14:1675--1680, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  29. R. J. Lipshutz, S. P. A. Fodor, T. R. Gingeras, and D. H. Lockhart. High density synthetic oligonucleotide arrays. Nature Genetics, 21:20--24, 1999. Supplement.Google ScholarGoogle ScholarCross RefCross Ref
  30. J. B. Fan, D. Gehl, L. Hsie, K. Lindblad-Toh, J. P. Laviolette, E. Robinson, R. Lipshutz, D. Wang, T. J. Hudson, and D. Labuda. Assessing DNA sequence variations in human ests in a phylogenetic context using high-density oligonucleotide arrays. Genomics, 80:351--360, September 2002.Google ScholarGoogle ScholarCross RefCross Ref
  31. D. J. Cutler, M. E. Zwick, M. M. Carrasquillo, C. T. Yohn, K. P. Tobin, C. Kashuk, D. J. Mathews, N. A. Shah, E. E. Eichler, J. A. Warrington, and A. Chakravarti. High-throughput variation detection and genotyping using microarrays. Genome Research, 11:1913--1925, November 2001.Google ScholarGoogle ScholarCross RefCross Ref
  32. J. B. Fan, X. Chen, M. K. Halushka, A. Berno, X. Huang, T. Ryder, R. J. Lipshutz, D. J. Lockhart, and A. Chakravarti. Parallel genotyping of human SNPs using generic high-density oligonucleotide tag arrays. Genome Research, 10:853--860, June 2000.Google ScholarGoogle ScholarCross RefCross Ref
  33. G. C. Kennedy, H. Matsuzaki, D. Dong, W. Liu, J. Huang, G. Liu, X. Su, M. Cao, W. Chen, J. Zhang, W. Liu, G. Yang, X. Di, T. Ryder, Z. He, U. Surti, M. S. Phillips, M. T. Boyce-Jacino, S. P. A. Fodor, and K. W. Jones. Large-scale genotyping of complex DNA. Nature Biotechnology, October 2003.Google ScholarGoogle ScholarCross RefCross Ref
  34. P. Kapranov, S. E. Cawley, J. Drenkow, S. Bekiranov, R. L. Strausberg, S. P. A. Fodor, and T. R. Gingeras. Large-scale transcriptional activity in chromosomes 21 and 22. Science, 296:916--919, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  35. M. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, M. Ares Jr, and D. Haussler. Support vector machine classification of microarray gene expression data. Technical Report UCSC-CRL-99-09, Department of Computer Science, University of California at Santa Cruz, 1999.Google ScholarGoogle Scholar
  36. M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares Jr, and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, 97:262--267, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  37. P. Pavlidis, J. Weston, J. Cai, and W. N. Grundy. Gene functional classification from heterogeneous data. In Proceedings of the Fifth International Conference on Computational Molecular Biology, pages 242--248, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16:906--914, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  39. S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. P. Mesirov, and T. Poggio. Support vector machine classification of microarray data. Technical Report 182, Center for Biological and Computational Learning Massachusetts Institute of Technology, 1998.Google ScholarGoogle Scholar
  40. S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. P. Mesirov, T. Poggio, W. Gerald, M. Loda, E. S. Lander, and T. R. Golub. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences, 98, 2001.Google ScholarGoogle Scholar
  41. C. Yeang, S. Ramaswamy, P. Tamayo, S. Mukherjee, R. M. Rifkin, M. Angelo, M. Reich, E. Lander, J. Mesirov, and T. Golub. Molecular classification of multiple tumor types. Bioinformatics, 1:1--7, 2001.Google ScholarGoogle Scholar
  42. F. G. Cozman and I. Cohen. Unlabeled data can degrade classification performance of generative classifiers. In Fifteenth International Florida Artificial Intelligence Society Conference, pages 327--331, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classification problems. In Proceedings of the International Conference on Machine Learning, pages 1191--1198, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39:103--134, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. T. Li, S. Zhu, Q. Li, and M. Ogihara. Gene functional classification by semi-supervised learning from heterogeneous data. In Proceedings of the ACM Symposium on Applied Computing, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. G. R. G. Lanckriet, M. Deng, N. Cristianini, M. I. Jordan, and W. S. Noble. Kernel-based data fusion and its application to protein function prediction in yeast. In Proceedings of the Pacific Symposium on Biocomputing 2004, 2004.Google ScholarGoogle Scholar
  47. A. Shilton, M. Palaniswami, D. Ralph, and A. C. Tsoi. Incremental training in support vector machines. In Proceedings of the International Joint Conference on Neural Networks, 2001.Google ScholarGoogle Scholar
  48. C. P. Diehl. Toward Efficient Collaborative Classification for Distributed Video Surveillance. PhD thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J. H. Maindonald, Y. E. Pittelkow, and S. R. Wilson. Science and Statistics: A Festschrift for Terry Speed, volume 40 of IMS Lecture Notes--Monograph Series, chapter Some Considerations for the Design of Microarray Experiments, pages 367--390. Institute of Mathematical Statistics, 2003.Google ScholarGoogle Scholar
  50. R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Berclay, K. J. Antonellis, U. Scherf, and T. P. Speed. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4:249--264, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  51. C. Li and W. H. Wong. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proceedings of the National Academy of Science, 98:31--36, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  52. B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics, 19:185--193, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  53. B. I. P. Rubinstein and T. P. Speed. Detecting gene expression with oligonucleotide microarrays, 2003. manuscript in preparation.Google ScholarGoogle Scholar
  54. W. Liu, R. Mei, D. M. Bartell, X. Di, T. A. Webster, and T. Ryder. Rank-based algorithms for analysis of microarrays. Proceedings of SPIE, Microarrays: Optical Technologies and Informatics, 4266, 2001.Google ScholarGoogle Scholar
  55. W. M. Liu, R. Mei, X. Di, T. B. Ryder, E. Hubbell, S. Dee, T. A. Webster, C. A. Harrington, M. H. Ho, J. Baid, and S. P. Smeekens. Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics, 18:1593--1599, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  56. F. Provost and T. Fawcett. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Third International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, 1997.Google ScholarGoogle Scholar
  57. T. Fawcett, F. Provost, and R. Kohavi. The case against accuracy estimation for comparing induction algorithms. In Fifteenth International Conference on Machine Learning, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. D. Kampa and et al. Novel RNAs identified from a comprehensive analysis of the transcriptome of human chromosomes 21 and 22. Manuscript in preparation.Google ScholarGoogle Scholar
  59. W.-M. Liu, X. Di, G. Yang, H. Matsuzaki, J. Huang, R. Mei, T. B. Ryder, T. A. Webster, S. Dong, G. Liu, K. W. Jones, G. C. Kennedy, and D. Kulp. Algorithms for large scale genotyping microarrays. Bioinformatics, 2003. In press.Google ScholarGoogle Scholar
  60. Affymetrix. GeneChip CustomSeq resequencing array: Performance data for base calling algorithm in GeneChip DNA analysis software. Technical note, Affymetrix Inc., Santa Clara, CA, 2003. http://www.affymetrix.com/support/technical/technotes/customseq_technote.pdf.Google ScholarGoogle Scholar

Index Terms

  1. Machine learning in low-level microarray analysis
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader