Metagene projection for cross-platform, cross-species characterization of global transcriptional states
- Pablo Tamayo*,
- Daniel Scanfeld*,
- Benjamin L. Ebert*,
- Michael A. Gillette*,†,
- Charles W. M. Roberts‡, and
- Jill P. Mesirov*,§
- *Eli and Edythe L. Broad Institute, Massachusetts Institute of Technology and Harvard University, Cambridge, MA 02141;
- †Pulmonary and Critical Care Medicine, Massachusetts General Hospital, 55 Fruit Street, Boston, MA 02114; and
- ‡Department of Pediatric Oncology, Dana–Farber Cancer Institute, Boston, MA 02115
-
Communicated by Edward M. Scolnick, The Broad Institute, Cambridge, MA, February 6, 2007 (received for review December 7, 2006)
Abstract
The high dimensionality of global transcription profiles, the expression level of 20,000 genes in a much small number of samples, presents challenges that affect the sensitivity and general applicability of analysis results. In principle, it would be better to describe the data in terms of a small number of metagenes, positive linear combinations of genes, which could reduce noise while still capturing the invariant biological features of the data. Here, we describe how to accomplish such a reduction in dimension by a metagene projection methodology, which can greatly reduce the number of features used to characterize microarray data. We show, in applications to the analysis of leukemia and lung cancer data sets, how this approach can help assess and interpret similarities and differences between independent data sets, enable cross-platform and cross-species analysis, improve clustering and class prediction, and provide a computational means to detect and remove sample contamination.
Footnotes
- §To whom correspondence should be addressed. E-mail: mesirov{at}broad.mit.edu
-
Author contributions: P.T. and J.P.M. designed research; P.T., D.S., B.L.E., C.W.M.R., and J.P.M. performed research; M.A.G. contributed data; P.T., D.S., and J.P.M. analyzed data; and P.T., B.L.E., and J.P.M. wrote the paper.
-
The authors declare no conflict of interest.
-
This article contains supporting information online at www.pnas.org/cgi/content/full/0701068104/DC1.
- Abbreviations:
- GSEA,
- gene set enrichment analysis;
- NMF,
- nonnegative matrix factorization;
- SVM,
- support vector machine.
-
Freely available online through the PNAS open access option.
- © 2007 by The National Academy of Sciences of the USA





