doi:10.1016/S0167-9473(03)00069-0
Copyright © 2003 Elsevier B.V. All rights reserved.
Transformations, background estimation, and process effects in the statistical analysis of microarrays*1
a Department of Mathematics, University of Colorado-Denver, P.O. Box 173364, CB170, Denver, CO 80217-3364, USA
b Department of Pharmacology, University of Colorado, Health Sciences Center, Denver, CO 80262, USA
Received 1 August 2002;
accepted 1 March 2003. ;
Available online 15 April 2003.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
Microarray technology has made available large data sets that can provide information on gene expression when cells are subjected to various treatments. Before proceeding with a formal statistical analysis, many biological and procedural aspects should be considered. These aspects may guide the analysis and subsequent statistical inference. Several of these issues are discussed in connection with the analysis of oligonucleotide and cDNA microarray experiments. The particular focus in this article is on effects caused by the cDNA slide manufacturing process, appropriate transformations of the data, and on adjustments for background. A prescription for the analysis of microarray data is proposed and demonstrated using data from a cDNA experiment comparing the genetic expressions in two mouse cell lines; a candidate set of genes is identified for further study. The prescription may be modified for oligonucleotide microarray data.
Author Keywords: Background variation; Fluorescence; Lognormal distribution; Median polish; Process variation; Smoothing; Tukey's g-family of distributions
Fig. 1. Mean–variance plots for one set of 529 foreground counts in red channel of Block 8 (see Section 5 for description of the data). The 529 values were sorted and binned into 23 categories of 23 values each; the left panel plots the sample variance versus the sample mean for all 23 categories. The right panel shows the points from only those categories whose sample means lie between 550 and 800. The fitted quadratic is: variance=15+0.005358(mean−648)2.
Fig. 2. Comparing two transformations for the data illustrated in
Fig. 1. The solid line is the function
f(
x) given by
Eq. (2), where α=15, β=0.005358, and γ=648. The dotted line is the function
z(
x) given by
Eq. (4), where
g=0.58,
a=510, and
b=218. The two transformations are matched to coincide at
x=500.
Fig. 3. Quantile–quantile plots ([
Wilk and Gnanadesikan 1968]) of transformed data values in
Fig. 1 via
f(
x) [
Eq. (2)], where α=15, β=0.005358, and γ=648, denoted as “Log+sqrt transformation”, and
gy(
x) [
Eq. (4)], where
g=0.58,
a=510,
b=218, denoted as “inverse-g transformation”. The top row shows the result applied to all 529 values; the bottom row shows the result in only those 486 values whose g-transformed value lies between −2.5 and 2.5. The inverse
g transformation results in a distribution that appears roughly Gaussian with a long right tail, which may indicate genes with significantly high expression (see
Section 5).
Fig. 4. Results of applying median polish to the background counts in the red and green channels. The top (respectively, bottom) plots show the effect of the row (respectively, column) number (red on left and green on right); i.e., the estimated number of laser scanning units of light intensity above or below the overall term
m8r=223 (respectively,
m8g=337). Limits of 1 standard error based on 200 bootstrap replications of the residuals are shown.
Fig. 5. Layer and stripe effects resulting from a median polish of the overall terms,
mkr and
mkg, where
k=1,2,3,4 refers to the blocks in the first layer, and
k=29,30,31,32 refers to the blocks in the eighth layer, and
k=4,8,12,…,32 refers to the fourth (rightmost) stripe on the slide. Limits of 1 standard error based on 200 bootstrap replications of the residuals are shown.
Fig. 6. Coded residuals from the median polish fit to the 529 red background values in Block 8. Radii of circles are proportional to magnitudes of negative residuals; sides of squares are proportional to magnitudes of positive residuals. Structure in residuals indicates need for an additional term in the additive fit.
Fig. 7. Coded residuals from the median polish fit plus one degree of freedom for non-additivity [
Eq. (7)] to the 529 red background values in Block 8. Radii of circles are proportional to magnitudes of negative residuals; sides of squares are proportional to magnitudes of positive residuals. Residuals range in size from –23 to +35; lack of structure in the plot indicates success of the “plus-one fit.”
Fig. 8. Coded plot of
ZR and
ZG in blocks 7 and 8, obtained via transforming the foreground values for fitted background in the red and green channels. Radii of circles are proportional to magnitudes of negative
Z-values; length of side squares are proportional to magnitudes of positive
Z-values. Lack of structure indicates success of the model for background in removing spatial structure.
Fig. 9. Left panel: Plot of
ZG versus
ZR (labeled on the plot as
Zg and
Zr), the transformed and background-adjusted spot intensities in block 8. Estimated correlation coefficient is 0.97 (Pearson) and 0.94 (robust); lines from which they are estimated are shown (but barely distinguishable) on the plot. Right panel: Plot of
ZR−
ZG versus
ZR+
ZG, to remove visual effect of extreme correlation. Limits of 3 estimated standard deviations based on the two correlation estimates are shown (solid=Pearson; DASHED=robust).
Fig. 10. Plots of
ZR−
ZG versus
ZR+
ZG in Blocks 1–16. Limits of 3 estimated standard deviations based on the two correlation estimates are shown (dashed=Pearson; SOLID=robust).
Fig. 11. Plots of
ZR−
ZG versus
ZR+
ZG in Blocks 17–32. Limits of 3 estimated standard deviations based on the two correlation estimates are shown (solid=Pearson; DASHED=robust).
Table 1.
