A mixture model approach for the analysis of microarray gene expression data

https://doi.org/10.1016/S0167-9473(01)00046-9Get rights and content

Abstract

Microarrays have emerged as powerful tools allowing investigators to assess the expression of thousands of genes in different tissues and organisms. Statistical treatment of the resulting data remains a substantial challenge. Investigators using microarray expression studies may wish to answer questions about the statistical significance of differences in expression of any of the genes under study, avoiding false positive and false negative results. We have developed a sequence of procedures involving finite mixture modeling and bootstrap inference to address these issues in studies involving many thousands of genes. We illustrate the use of these techniques with a dataset involving calorically restricted mice.

Introduction

A relatively recent addition to the armamentarium of investigators studying the genetics, molecular biology, and physiology of living organisms is the use of microarrays. This technique enables simultaneous and rapid assessment of the expression of literally thousands of genes or ESTs1 in different tissues, different groups of experimental animals, humans, or other organisms, or organisms measured under different circumstances (Kahn et al., 1999). After the gene expression is measured, one may try to identify those genes for which there is differential expression across groups, and then interpret the results.

In brief, there are two broad classes of microarrays for gene expression measurement. One class, cDNA arrays, apply spots of cDNAs to glass slides. One can than estimate mRNA by examining hybridization to the cDNA spots. These arrays, though somewhat easier to develop, may not be as specific in their measurement properties as the second class of arrays, namely oligonucleotide arrays. Oligonucleotide arrays place many thousands of gene-specific oligonucleotides in silico and allow one to examine mRNA binding to the oligonucleotides after correcting for estimates of background binding (‘noise’). For more details, see Weindruch et al. (2001) and references therein.

An example of this approach can be found in Lee et al. (1999) who studied differences in gene expression in 6347 genes in three groups of mice: (a) old mice; (b) young mice; and (c) old mice that had their caloric intake restricted since weaning. Each group consisted of three mice. Using a criterion of a two-fold increase in gene expression in the group exhibiting higher expression, Lee et al. (1999) identified 113 genes that appeared to exhibit differential expression between groups a and b. The two-fold criteria is admittedly somewhat arbitrary and Lee et al. (1999) did not provide any formal statistical information regarding significance levels, confidence intervals around effect size estimates, or related statistics. This is not surprising since detailed discussions of how to approach such data from a statistical point of view are notably absent from the literature and the approach used by Lee et al. (1999) appears to be state of the art (see White et al. (1999) for a similar example taking a similar statistical approach).

Investigators, of course, may wish to answer the question “Is the difference in expression for the such-and-such gene statistically significant?” However, there are number of other questions that are at least equally important and interesting including: (Q1) is there statistically significant evidence that any of the genes under study exhibit a difference in expression across the groups?; (Q2) what is the best estimate of the number of genes for which there is a true difference in gene expression?; (Q3) what is the confidence interval around that estimate?; (Q4) if we set some threshold above which we declare results for a particular gene ‘interesting’ and worthy of follow-up study, what proportion of those genes are likely to be genes for which there is a real difference in expression and what proportion are likely to be false leads?; (Q5) what proportion of those genes not declared ‘interesting’ are likely to be genes for which there is a real difference in expression (i.e., misses or false negatives)?

A key challenge to the development of statistical methods for microarray data is the fact that the sample size (e.g., the number of mice) is often small but the number of measurements per item (the number of genes) is very large. The expression levels for an individual gene or EST may not be independent. If they were, statistical models could be developed that model gene expression levels as independent measurements. The absence of such models limits one's ability to answer many important questions regarding the distribution of differential gene expression levels across two or more groups.

The purpose of this paper is to present methods for addressing questions above. We begin with only two very general assumptions and another specific assumption that we later relax. The two general assumptions are: (1) For each gene (EST), the measurements of gene expression have a finite population mean and variance; (2) For each gene under study, there is a measure of expression available for each case and this measure has sufficient reliability and validity to be useful (we use the word ‘case’ generically to refer to any organism or tissue on which expression is measured). The extent to which any particular measure of gene expression meets the second criterion is an important question, but beyond the scope of this paper. The more specific assumption that we later relax is that differences in gene expression levels across two groups are independent. This allows the development of a statistical model that will facilitate answers to many of the above questions (and perhaps additional questions). After presenting the methodology, we then illustrate it using a data example. A simulation study evaluates the performance of the methods, assesses the effect that “dependence” has on the interpretation of results from the model, and provides a mechanism to allow for non-independence. Finally, we provide a discussion and description of future work.

Section snippets

Methods

Suppose that N=2n cases are randomly divided into two groups of size n.2

An illustrative example

To illustrate the methods that we are proposing, we analyze data described by Lee et al. (2000). Two groups of mice were considered: each group contains three mice.3 A distribution of 6347 t-distribution based p-values was obtained. Each p-value was obtained from the test, for a specified gene, H0: there is no difference in

Simulation study

To remain consistent with the previous example we use the term “mouse” to refer to a case or an experimental unit. We generated gene expression levels for 2n mice (n mice per experimental group, where n=5,10,20, and 40), and k=3000 genes. The data for the 2n mice are multivariate normal and generated independently from a 3000 dimensional normal distribution. That is, measurements for a mouse are generated fromX∼N3000(μ̄,Σ),where μ was a constant vector of length 3000 and equal to 10 and Σ=σ2BI6

Issues, future work, and general discussion

In the present paper, we have developed a set of procedures for analyzing microarray gene expression data that are intended to not only take into account but indeed to capitalize on the fact that many thousands of genes may be studied. This set of procedures should lead to the ability to answer important questions about group or condition differences in gene expression, can be adapted to allow for non-normality and heteroscedasticity, and may be used with small sample sizes. Nevertheless, it

Acknowledgements

This research was supported in part by NIH grants R29DK47256, R01DK51716, P30DK26687, P01AG11915, R01ES09912 and 5 U24 DK58776. We are grateful to Drs. Nicholas Schork, Joel Horowitz, and Michael C. Neale for their helpful comments.

References (32)

  • Efron, B., 1982. The jackknife, the bootstrap and other resampling plans. Society for Industrial and Applied...
  • M. Evans et al.

    Statistical Distributions, A Wiley-Interscience Publication.

    (1993)
  • P.I. Good

    Resampling Methods. A Practical Guide to Data Analysis

    (1999)
  • J.A. Hanley

    Receiver operating characteristic (ROC) methodology: the state of the art

    Crit. Rev. Diagnostic Imaging

    (1989)
  • K.W. Hasel et al.

    Nucleotide sequence of a cDNA coding for mouse cyclophilin

    Nucleic Acids Res.

    (1990)
  • Horowitz, J.L. The bootstrap. In: Handbook of Econometrics, Vol. 5. Elsevier, New York, in...
  • Cited by (0)

    View full text