A mixture model approach for the analysis of microarray gene expression data
Introduction
A relatively recent addition to the armamentarium of investigators studying the genetics, molecular biology, and physiology of living organisms is the use of microarrays. This technique enables simultaneous and rapid assessment of the expression of literally thousands of genes or ESTs1 in different tissues, different groups of experimental animals, humans, or other organisms, or organisms measured under different circumstances (Kahn et al., 1999). After the gene expression is measured, one may try to identify those genes for which there is differential expression across groups, and then interpret the results.
In brief, there are two broad classes of microarrays for gene expression measurement. One class, cDNA arrays, apply spots of cDNAs to glass slides. One can than estimate mRNA by examining hybridization to the cDNA spots. These arrays, though somewhat easier to develop, may not be as specific in their measurement properties as the second class of arrays, namely oligonucleotide arrays. Oligonucleotide arrays place many thousands of gene-specific oligonucleotides in silico and allow one to examine mRNA binding to the oligonucleotides after correcting for estimates of background binding (‘noise’). For more details, see Weindruch et al. (2001) and references therein.
An example of this approach can be found in Lee et al. (1999) who studied differences in gene expression in 6347 genes in three groups of mice: (a) old mice; (b) young mice; and (c) old mice that had their caloric intake restricted since weaning. Each group consisted of three mice. Using a criterion of a two-fold increase in gene expression in the group exhibiting higher expression, Lee et al. (1999) identified 113 genes that appeared to exhibit differential expression between groups a and b. The two-fold criteria is admittedly somewhat arbitrary and Lee et al. (1999) did not provide any formal statistical information regarding significance levels, confidence intervals around effect size estimates, or related statistics. This is not surprising since detailed discussions of how to approach such data from a statistical point of view are notably absent from the literature and the approach used by Lee et al. (1999) appears to be state of the art (see White et al. (1999) for a similar example taking a similar statistical approach).
Investigators, of course, may wish to answer the question “Is the difference in expression for the such-and-such gene statistically significant?” However, there are number of other questions that are at least equally important and interesting including: (Q1) is there statistically significant evidence that any of the genes under study exhibit a difference in expression across the groups?; (Q2) what is the best estimate of the number of genes for which there is a true difference in gene expression?; (Q3) what is the confidence interval around that estimate?; (Q4) if we set some threshold above which we declare results for a particular gene ‘interesting’ and worthy of follow-up study, what proportion of those genes are likely to be genes for which there is a real difference in expression and what proportion are likely to be false leads?; (Q5) what proportion of those genes not declared ‘interesting’ are likely to be genes for which there is a real difference in expression (i.e., misses or false negatives)?
A key challenge to the development of statistical methods for microarray data is the fact that the sample size (e.g., the number of mice) is often small but the number of measurements per item (the number of genes) is very large. The expression levels for an individual gene or EST may not be independent. If they were, statistical models could be developed that model gene expression levels as independent measurements. The absence of such models limits one's ability to answer many important questions regarding the distribution of differential gene expression levels across two or more groups.
The purpose of this paper is to present methods for addressing questions above. We begin with only two very general assumptions and another specific assumption that we later relax. The two general assumptions are: (1) For each gene (EST), the measurements of gene expression have a finite population mean and variance; (2) For each gene under study, there is a measure of expression available for each case and this measure has sufficient reliability and validity to be useful (we use the word ‘case’ generically to refer to any organism or tissue on which expression is measured). The extent to which any particular measure of gene expression meets the second criterion is an important question, but beyond the scope of this paper. The more specific assumption that we later relax is that differences in gene expression levels across two groups are independent. This allows the development of a statistical model that will facilitate answers to many of the above questions (and perhaps additional questions). After presenting the methodology, we then illustrate it using a data example. A simulation study evaluates the performance of the methods, assesses the effect that “dependence” has on the interpretation of results from the model, and provides a mechanism to allow for non-independence. Finally, we provide a discussion and description of future work.
Section snippets
Methods
Suppose that N=2n cases are randomly divided into two groups of size n.2
An illustrative example
To illustrate the methods that we are proposing, we analyze data described by Lee et al. (2000). Two groups of mice were considered: each group contains three mice.3 A distribution of 6347 t-distribution based p-values was obtained. Each p-value was obtained from the test, for a specified gene, H0: there is no difference in
Simulation study
To remain consistent with the previous example we use the term “mouse” to refer to a case or an experimental unit. We generated gene expression levels for 2n mice (n mice per experimental group, where , and 40), and k=3000 genes. The data for the 2n mice are multivariate normal and generated independently from a 3000 dimensional normal distribution. That is, measurements for a mouse are generated fromwhere μ was a constant vector of length 3000 and equal to 10 and
Issues, future work, and general discussion
In the present paper, we have developed a set of procedures for analyzing microarray gene expression data that are intended to not only take into account but indeed to capitalize on the fact that many thousands of genes may be studied. This set of procedures should lead to the ability to answer important questions about group or condition differences in gene expression, can be adapted to allow for non-normality and heteroscedasticity, and may be used with small sample sizes. Nevertheless, it
Acknowledgements
This research was supported in part by NIH grants R29DK47256, R01DK51716, P30DK26687, P01AG11915, R01ES09912 and 5 U24 DK58776. We are grateful to Drs. Nicholas Schork, Joel Horowitz, and Michael C. Neale for their helpful comments.
References (32)
- et al.
Genomic-scale gene expression analysis of lymphocyte growth, tolerance and malignancy
Curr. Opinion Immunol.
(2000) - et al.
On improving standard estimators via linear empirical Bayes methods
Statist. Probab. Lett.
(1999) - et al.
Microarray profiling of gene expression in aging and its alteration by caloric restriction in mice
J Nutr.
(2001) QTL analysis: power, precision, and accuracy.
Bootstrap Methods. Wiley Series in Probability and Statistics.
(1999)Practical Nonparametric Statistics, Wiley Series in Probability and Statistics
(1999)- et al.
Characterization of the mouse cDNA and gene coding for a hepatocyte growth factor-like protein: expression during development
Biochemistry
(1991) - et al.
Genomic control for association studies
Biometrics
(1999) A note on information seldom reported via the p value
Amer. Statist.
(1999)- et al.
False discoveries in genome scanning
Genet. Epidemiol.
(1997)