article

Machine learning in low-level microarray analysis

Authors:
Benjamin I. P. Rubinstein

University of Melbourne, Australia

University of Melbourne, Australia
View Profile

,
Jon McAuliffe

University of California at Berkeley, CA

University of California at Berkeley, CA
View Profile

,
Simon Cawley

Data Analysis Group, Affymetrix, Inc., Santa Clara, CA

Data Analysis Group, Affymetrix, Inc., Santa Clara, CA
View Profile

,
Marimuthu Palaniswami

University of Melbourne, Australia

University of Melbourne, Australia
View Profile

,
Kotagiri Ramamohanarao

University of Melbourne, Australia

University of Melbourne, Australia
View Profile

,
Terence P. Speed

The Walter & Eliza Hall Institute of Medical Research, Australia

The Walter & Eliza Hall Institute of Medical Research, Australia
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 5 Issue 2December 2003pp 130–139https://doi.org/10.1145/980972.980988

Published:01 December 2003Publication History

ACM SIGKDD Explorations Newsletter

Abstract

Machine learning and data mining have found a multitude of successful applications in microarray analysis, with gene clustering and classification of tissue samples being widely cited examples. Low-level microarray analysis -- often associated with the pre-processing stage within the microarray life-cycle -- has increasingly become an area of active research, traditionally involving techniques from classical statistics. This paper explores opportunities for the application of machine learning and data mining methods to several important low-level microarray analysis problems: monitoring gene expression, transcript discovery, genotyping and resequencing. Relevant methods and ideas from the machine learning community include semi-supervised learning, learning from heterogeneous data, and incremental learning.

References

Affymetrix. Affymetrix Microarray Suite Guide. Affymetrix Inc., Santa Clara, CA, 2001. version 5.0.Google Scholar
M. Schena. DNA Microarrays: A Practical Approach. Oxford University Press, 1999.Google Scholar
Affymetrix. Statistical algorithms description document. Whitepaper, Affymetrix Inc., Santa Clara, CA, 2002.Google Scholar
A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In Fourteenth Conference on Uncertainty in Artificial Intelligence, pages 148--155. Morgan Kaufmann Publishers, 1998. Google ScholarDigital Library
P. K. Varshney. Scanning the issue: Special issue on data fusion. Proceedings of the IEEE, 85:3--5, 1997.Google Scholar
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, New York, 1973. Google ScholarDigital Library
T. Gaasterland and S. Bekiranov. Making the most of microarray data. Nature Genetics, 24:204--206, 2000.Google ScholarCross Ref
K. P. Bennett and A. Demiriz. Semi-supervised support vector machines. In Advances in Neural Information Processing Systems 11, pages 368--374, Cambridge, MA, 1999. MIT Press. Google ScholarDigital Library
T. Joachims. Transductive learning via spectral graph partitioning. In Proceedings of the International Conference on Machine Learning (ICML), 2003.Google Scholar
V. Castelli and T. Cover. On the exponential value of labeled samples. Pattern Recognition Letters, 16:105--111, 1995. Google ScholarDigital Library
R. A. Irizarry. Science and Statistics: A Festschrift for Terry Speed, volume 40 of Lecture Notes--Monograph Series, chapter Measures of gene expression for Affymetrix high density oligonucleotide arrays, pages 391--402. Institute of Mathematical Statistics, 2003.Google Scholar
R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299--314, 1996.Google Scholar
N. Friedman. Probabilistic models for identifying regulation networks. Bioinformatics, 19:II57, October 2003.Google Scholar
E. Hubbell, W. M. Liu, and R. Mei. Robust estimators for expression analysis. Bioinformatics, 18:1585--1592, 2002.Google ScholarCross Ref
Bioconductor Core. An overview of projects in computing for genomic analysis. Technical report, The Bioconductor Project, 2002.Google Scholar
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann Publishers, 1998. Google ScholarDigital Library
L. Wu, S. L. Oviatt, and P. R. Cohen. Multimodal integration - a statistical view. IEEE Transactions on Multimedia, 1:334 --341, 1999. Google ScholarDigital Library
A. J. Hartemink and E. Segal. Joint learning from multiple types of genomic data. In Proceedings of the Pacific Symposium on Biocomputing 2004, 2004.Google Scholar
G. K. Smyth, Y. H. Yang, and T. P. Speed. Functional Genomics: Methods and Protocols, volume 224 of Methods in Molecular Biology, chapter Statistical issues in cDNA microarray data analysis, pages 111--136. Humana Press, Totowa, NJ, 2003.Google Scholar
A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In International Conference on Machine Learning (ICML), 2001. Google ScholarDigital Library
R. Klinkenberg and T. Joachims. Detecting concept drift with support vector machines. In P. Langley, editor, Proceedings of ICML-00, 17th International Conference on Machine Learning, pages 487--494, Stanford, CA, 2000. Morgan Kaufmann Publishers. Google ScholarDigital Library
M. Szummer and T. Jaakkola. Partially labeled classification with Markov random walks. In Neural Information Processing Systems (NIPS), 2001.Google Scholar
T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems 11: Proceedings of the 1998 Conference, pages 487--493. MIT Press, 1998. Google ScholarDigital Library
G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In NIPS, pages 409--415, 2000.Google Scholar
T. Joachims. Transductive inference for text classification using support vector machines. In I. Bratko and S. Dzeroski, editors, Proceedings of the 16th Annual Conference on Machine Learning, pages 200--209. Morgan Kaufmann, 1999. Google ScholarDigital Library
O. Chapelle, V. Vapnik, and J. Weston. Advances in Neural Information Processing Systems 12, chapter Transductive inference for estimating values of functions. MIT Press, 2000.Google Scholar
M. Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270:467--470, 1995.Google ScholarCross Ref
D. J. Lockhart, H. Dong, M. C. Byrne, M. T. Follettie, M. V. Gallo, M. S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, and E. L. Brown. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnology, 14:1675--1680, 1996.Google ScholarCross Ref
R. J. Lipshutz, S. P. A. Fodor, T. R. Gingeras, and D. H. Lockhart. High density synthetic oligonucleotide arrays. Nature Genetics, 21:20--24, 1999. Supplement.Google ScholarCross Ref
J. B. Fan, D. Gehl, L. Hsie, K. Lindblad-Toh, J. P. Laviolette, E. Robinson, R. Lipshutz, D. Wang, T. J. Hudson, and D. Labuda. Assessing DNA sequence variations in human ests in a phylogenetic context using high-density oligonucleotide arrays. Genomics, 80:351--360, September 2002.Google ScholarCross Ref
D. J. Cutler, M. E. Zwick, M. M. Carrasquillo, C. T. Yohn, K. P. Tobin, C. Kashuk, D. J. Mathews, N. A. Shah, E. E. Eichler, J. A. Warrington, and A. Chakravarti. High-throughput variation detection and genotyping using microarrays. Genome Research, 11:1913--1925, November 2001.Google ScholarCross Ref
J. B. Fan, X. Chen, M. K. Halushka, A. Berno, X. Huang, T. Ryder, R. J. Lipshutz, D. J. Lockhart, and A. Chakravarti. Parallel genotyping of human SNPs using generic high-density oligonucleotide tag arrays. Genome Research, 10:853--860, June 2000.Google ScholarCross Ref
G. C. Kennedy, H. Matsuzaki, D. Dong, W. Liu, J. Huang, G. Liu, X. Su, M. Cao, W. Chen, J. Zhang, W. Liu, G. Yang, X. Di, T. Ryder, Z. He, U. Surti, M. S. Phillips, M. T. Boyce-Jacino, S. P. A. Fodor, and K. W. Jones. Large-scale genotyping of complex DNA. Nature Biotechnology, October 2003.Google ScholarCross Ref
P. Kapranov, S. E. Cawley, J. Drenkow, S. Bekiranov, R. L. Strausberg, S. P. A. Fodor, and T. R. Gingeras. Large-scale transcriptional activity in chromosomes 21 and 22. Science, 296:916--919, 2002.Google ScholarCross Ref
M. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, M. Ares Jr, and D. Haussler. Support vector machine classification of microarray gene expression data. Technical Report UCSC-CRL-99-09, Department of Computer Science, University of California at Santa Cruz, 1999.Google Scholar
M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares Jr, and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, 97:262--267, 1997.Google ScholarCross Ref
P. Pavlidis, J. Weston, J. Cai, and W. N. Grundy. Gene functional classification from heterogeneous data. In Proceedings of the Fifth International Conference on Computational Molecular Biology, pages 242--248, 2001. Google ScholarDigital Library
T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16:906--914, 2000.Google ScholarCross Ref
S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. P. Mesirov, and T. Poggio. Support vector machine classification of microarray data. Technical Report 182, Center for Biological and Computational Learning Massachusetts Institute of Technology, 1998.Google Scholar
S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. P. Mesirov, T. Poggio, W. Gerald, M. Loda, E. S. Lander, and T. R. Golub. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences, 98, 2001.Google Scholar
C. Yeang, S. Ramaswamy, P. Tamayo, S. Mukherjee, R. M. Rifkin, M. Angelo, M. Reich, E. Lander, J. Mesirov, and T. Golub. Molecular classification of multiple tumor types. Bioinformatics, 1:1--7, 2001.Google Scholar
F. G. Cozman and I. Cohen. Unlabeled data can degrade classification performance of generative classifiers. In Fifteenth International Florida Artificial Intelligence Society Conference, pages 327--331, 2002. Google ScholarDigital Library
T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classification problems. In Proceedings of the International Conference on Machine Learning, pages 1191--1198, 2000. Google ScholarDigital Library
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39:103--134, 2000. Google ScholarDigital Library
T. Li, S. Zhu, Q. Li, and M. Ogihara. Gene functional classification by semi-supervised learning from heterogeneous data. In Proceedings of the ACM Symposium on Applied Computing, 2003. Google ScholarDigital Library
G. R. G. Lanckriet, M. Deng, N. Cristianini, M. I. Jordan, and W. S. Noble. Kernel-based data fusion and its application to protein function prediction in yeast. In Proceedings of the Pacific Symposium on Biocomputing 2004, 2004.Google Scholar
A. Shilton, M. Palaniswami, D. Ralph, and A. C. Tsoi. Incremental training in support vector machines. In Proceedings of the International Joint Conference on Neural Networks, 2001.Google Scholar
C. P. Diehl. Toward Efficient Collaborative Classification for Distributed Video Surveillance. PhD thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University, 2000. Google ScholarDigital Library
J. H. Maindonald, Y. E. Pittelkow, and S. R. Wilson. Science and Statistics: A Festschrift for Terry Speed, volume 40 of IMS Lecture Notes--Monograph Series, chapter Some Considerations for the Design of Microarray Experiments, pages 367--390. Institute of Mathematical Statistics, 2003.Google Scholar
R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Berclay, K. J. Antonellis, U. Scherf, and T. P. Speed. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4:249--264, 2003.Google ScholarCross Ref
C. Li and W. H. Wong. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proceedings of the National Academy of Science, 98:31--36, 2001.Google ScholarCross Ref
B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics, 19:185--193, 2003.Google ScholarCross Ref
B. I. P. Rubinstein and T. P. Speed. Detecting gene expression with oligonucleotide microarrays, 2003. manuscript in preparation.Google Scholar
W. Liu, R. Mei, D. M. Bartell, X. Di, T. A. Webster, and T. Ryder. Rank-based algorithms for analysis of microarrays. Proceedings of SPIE, Microarrays: Optical Technologies and Informatics, 4266, 2001.Google Scholar
W. M. Liu, R. Mei, X. Di, T. B. Ryder, E. Hubbell, S. Dee, T. A. Webster, C. A. Harrington, M. H. Ho, J. Baid, and S. P. Smeekens. Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics, 18:1593--1599, 2002.Google ScholarCross Ref
F. Provost and T. Fawcett. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Third International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, 1997.Google Scholar
T. Fawcett, F. Provost, and R. Kohavi. The case against accuracy estimation for comparing induction algorithms. In Fifteenth International Conference on Machine Learning, 1998. Google ScholarDigital Library
D. Kampa and et al. Novel RNAs identified from a comprehensive analysis of the transcriptome of human chromosomes 21 and 22. Manuscript in preparation.Google Scholar
W.-M. Liu, X. Di, G. Yang, H. Matsuzaki, J. Huang, R. Mei, T. B. Ryder, T. A. Webster, S. Dong, G. Liu, K. W. Jones, G. C. Kennedy, and D. Kulp. Algorithms for large scale genotyping microarrays. Bioinformatics, 2003. In press.Google Scholar
Affymetrix. GeneChip CustomSeq resequencing array: Performance data for base calling algorithm in GeneChip DNA analysis software. Technical note, Affymetrix Inc., Santa Clara, CA, 2003. http://www.affymetrix.com/support/technical/technotes/customseq_technote.pdf.Google Scholar

Index Terms

Machine learning in low-level microarray analysis
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification

Index terms have been assigned to the content through auto-classification.

Recommendations

Microarray analysis of autoimmune diseases by machine learning procedures

Microarray-based global gene expression profiling, with the use of sophisticated statistical algorithms is providing new insights into the pathogenesis of autoimmune diseases. We have applied a novel statistical technique for gene selection based on ...
Read More
The impact of RNA-seq aligners on gene expression estimation
BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics

While numerous RNA-seq data analysis pipelines are available, research has shown that the choice of pipeline influences the results of differentially expressed gene detection and gene expression estimation. Gene expression estimation is a key step in ...
Read More
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 5, Issue 2
December 2003
202 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/980972
Issue’s Table of Contents

Copyright © 2003 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2003
Check for updates
Author Tags
gene expression estimation
genotyping
incremental learning
learning from heterogeneous data
low-level microarray analysis
re-sequencing
semi-supervised learning
transcript discovery
transductive learning
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 1,152
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Machine learning in low-level microarray analysis

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Microarray analysis of autoimmune diseases by machine learning procedures

The impact of RNA-seq aligners on gene expression estimation

Transductive Multilabel Learning via Label Set Propagation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Machine learning in low-level microarray analysis

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Microarray analysis of autoimmune diseases by machine learning procedures

The impact of RNA-seq aligners on gene expression estimation

Transductive Multilabel Learning via Label Set Propagation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media