Simulation of cDNA microarrays via a parameterized random signal model

Yoganand Balagurunathan; Edward R. Dougherty; Yidong Chen; Michael L. Bittner; Jeffrey M. Trent

doi:10.1117/1.1486246

1 July 2002 Simulation of cDNA microarrays via a parameterized random signal model

Yoganand Balagurunathan, Edward R. Dougherty, Yidong Chen, Michael L. Bittner, Jeffrey M. Trent

Author Affiliations +

Journal of Biomedical Optics, Vol. 7, Issue 3, (July 2002). https://doi.org/10.1117/1.1486246

1. Introduction

Since the inception of cDNA microarray technology¹ as a high throughput method to gain information about gene functions and characteristics of biological samples, many applications of the technology have been reported.² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ¹⁰ With the improvement of the technology, including fabrication, fluorescent labeling, hybridization, and detection, many computer software packages for extracting signals arising from tagged mRNA hybridized to arrayed cDNA locations have been designed and applied in various experiments.¹¹ ¹² ¹³ As reported in Ref. 11, a target detection procedure has been implemented that utilizes manually specified target arrays, extracts the background via the image histogram, predicts target shape and then evaluates the intensities from each cDNA location and its corresponding ratio quantity.

While most software packages are satisfactory for routine image analysis and the extraction of information regarding phenomena with highly expressed genes, the desire to discover subtle effects via microarray experiments will ultimately drive experiments towards the limit of the technology,¹³ with less starting mRNA and/or more weakly expressed genes. Weak signals and their interaction with background fluorescent noise are most problematic. Problems include the nonlinear trend in expression scatter plots, fishtailing at lower signal range, low measurement quality of expression levels due to uneven local background, and small cDNA-deposition areas. These artifacts, or sources of uncertainty, creep into higher-level statistical data analyses, such as clustering and classification, raising concerns about their validity. Numerous remedies have been proposed, such as carefully designed experiments in which duplications are used to minimize the uncertainty.¹⁴ ¹⁵ However, given the scarcity of certain biological samples, large duplications of experiments are often impractical. To improve detection and quantification of weak targets, it is important to understand the entire process of microarray formation, from fabrication to the scanning microscope. Use of the knowledge that the average intensity of the background fluorescence is normally distributed to help design a background detection algorithm is one example of incorporating prior knowledge into detection methods.¹⁶

A complex electrical-optical-chemical process is involved in cDNA-microarray technology, from fabrication of the cDNA slide, to preparing the RNA, to hybridization, to the capture of images created from excitation of the attached fluors. This complex process possesses multiple random factors. Images arising from it must be processed digitally to obtain the gene expression intensities and/or ratios that quantify relative expression levels.¹¹ The efficacy of the analysis to be carried out on the ratios, be it clustering,³ ¹⁷ ¹⁸ ¹⁹ classification,⁵ ¹⁰ prediction,²⁰ ²¹ or some other, depends on the ability of the imaging algorithm to extract sufficiently accurate and consistent intensity levels from the spots. As is common in imaging applications, it is difficult (or perhaps impossible) to utilize physical ground truth as a standard by which to evaluate algorithm performance. Hence, it is common to proceed by modeling the imaging process to simulate the various aspects of the real image process.²² ²³ ²⁴ Image processing algorithms can be applied to the simulated process to evaluate their performance. One might also concurrently adjust the model parameters to see how changing various random components of the formation process impacts upon the final images, and therefore the ability to extract meaningful information. For instance, an algorithm might have biases at low signal intensities or high noise intensities that are not present at higher signal intensities or lower noise intensities. Here it should be recognized that “ground truth” refers to the true signal intensity, not the actual quantity of mRNA in the sample corresponding to the DNA in the spot.

Modeling anything but a very simple physical process is a very challenging task. A physical process is typically influenced, directly or indirectly, by forces whose interrelation is unknown. The resulting model will be a random process. Each realization of the model depends on random variables chosen according to various model distributions. A good quantifiable model must approximate the physical process and have realistic variability to describe the randomness of the system. In the present work, microarray image formation is modeled by a series of random processes influenced by almost two dozen parameters. We will describe the modeling process in terms of the various random variables that determine spot size, shape, and intensity, as well as variables that affect the background, including noise. Each random variable is associated with a distribution. In some cases, one may select the parameters of the distribution (such as mean and variance for a normal distribution) to reflect the image qualities of interest, such as brightness, spot size, noise intensity, etc. In other cases, the distribution of a random variable is dependent on the outcome of some other variable, and it is possible that the parameters governing the distribution of a random variable may themselves be random variables.

Although we postulate various distributions to govern the variables in the model, one may wish to use other distributions to characterize the signal and noise distributions. Moreover, the experimenter is free to choose the parameters of the distributions. Microarray technology is evolving rapidly, and there are already many variations of the technology in use. Hence, model flexibility is mandatory. For instance, for a microarray system that does not produce doughnut holes in the spots, the variables associated with the hole can be nullified. In the case of a stable system in use without change for a sufficiently long period to produce a large number of images, one can apply statistical estimation to determine some model parameters, such as those for spot radius. Clearly, these estimates will only be of value to the specific system from which they have been derived. Hence, they remain outside the simulation package per se.

The simulation algorithm produces spots at a preset grid of locations that resemble the actual microarray. Each block corresponds to a specific pin of the robot hand, and the interblock variation is modeled in the simulation by allowing various model parameters to be randomized by block. At the start of each new block, the parameters of the spots are reset. The intention of the printing process is that spots possess regular circular shapes. Due to mechanical fatigue, the adhesion process for the DNA solution concentration, and biochemical interactions, various perturbations are possible in array preparation, printing, and scanning. Various features of the model simulate these random perturbations.

2. Simulation of cDNA Microarrays

The simulation of the cDNA microarray images is designed for two-color fluorescent systems with a scanning confocal microscope. A block diagram of the overall simulation process is given in Figure 1, which includes four main modules: fluorescent background simulation, simulation of cDNA target spot generation, postprocessing simulation and tagged image file format (TIFF) image output. Each simulation module contains many sequential steps (such as spot formation) or alternative steps (such as different background fluorescence). We will discuss each step according to the order in Figure 1 in the following subsections.

Figure 1

Figure shows the steps involved in generating the microarray.

2.1.

Background Simulation

The fluorescent background level is an important part of expression-level estimation, since we routinely use the additive model to subtract the local background from the signal intensity measurement. It is understood that when the signal is sufficiently low, the interaction between the fluorescent background and signal affect the estimation process in most image analysis programs, resulting in lower measurement quality in the expression ratio. Many factors contribute to the observed fluorescent background: autofluorescence from the glass surface or the surface of the detection instrument, nonspecific binding of fluorescent residues after hybridization, local contamination from posthybridization slide handling, etc. A perfect system would yield a flat background possessing a normal distribution, while a microscope without an autofocus mechanism may produce a slanted background level if the slides are loaded unevenly. Some other extreme hybridization condition may cause higher nonspecific hybridization to the edge of the hybridization chamber, which effectively creates a parabolic surface of background noise. We leave the local contamination to the processing module in Sec. 2.3.

The background derived from surface fluorescence upon laser excitation is usually governed by the Poisson process, which can be approximated by a normal distribution when the arrival rate, or the accumulation of photons, is large enough.¹⁶ This property can be readily assessed by the histogram of any background region of the microarray images. Therefore, background noise is simulated by a normal distribution whose parameters are randomly chosen to describe the process: I_b∼N(μ_b,σ_b ²). If multiple arrays are desired, the inter-array difference is modeled by a uniform distribution: μ_b∼U(a,b). σ_b is given as a multiple of μ_b: σ_b=k_bμ_b. Typically, k_b is about 10 of the mean background level.

Rather than be constant across the entire microarray, the mean of the background noise may vary owing to various scanning effects. It can take different shapes: parabolic, positive slope, or negative slope. In this case a function g(x,y) is first generated (parabolic, positive slope, or negative slope) to form a background surface and normal noise is added to it pixel wise. Thus, the background intensity is of the form I_b∼N(μ_b,σ_b ²) with μ_b=γg(x,y), where γ∼U(a,b) is the targeted background noise level. Background deviation is set independently for each channel: σ_b₁=k_b₁μ_b and σ_b₂=k_b₂μ_b. Figure 2 shows various noise backgrounds with k_b₁=k_b₂=0.1. All images are shown in large size on a web page.²⁷

Figure 2

Figure shows various background noises. The mean SNR is set at 1.0 for the slides. The slides have following settings: (a) parabolic back ground noise, (b) positive slope background, and (c) negative slope background all with global noise parameter. The background deviation factor is set at k_b₁=k_b₂=10.

In many practical examples, the nonspecific hybridization at the target location may be different from its peripheral region. Although one may have trouble pin-pointing this particular observation under normal conditions owing to signal interference, it is sometimes unmistakable when locations assumed to be weakly expressed, or not expressed at all, carry some nonzero readouts, or the intensity in the center is stronger than the doughnut ring if the printed target is doughnut shaped. We simulate this artifact under a gradient noise condition by allowing the background for the center holes to be at higher levels than the signal intensities. Hence, there is an option to use global background or local background information to set the noise parameter for the center hole. Figure 3 shows the effects of using local and global background parameters. This effect may not appear everywhere in a simulated image; however, it is often sufficient to require appropriate algorithm design in the image analysis program to lessen the penalty. The effects of weak targets will be further studied in later sections.

Figure 3

Example shows different noise settings for spots inner hole. Where (a) uses global background parameter to fill the center hole, (b) uses local background for filling the center hole. The background noise is set to sloped type with SNR of 1.5.

2.2.

Spot Simulation

cDNA deposition routinely follows a rigid grid defined by the robotic print pattern. The simulation algorithm produces spots at preset grid locations that resemble the actual microarray. In principle, print tips are manufactured uniformly; however, their microscopic morphologies, and thus their deposition-binding behaviors, are noticeably different. Each block corresponds to a specific print tip of the robot hand. To take tip variability into account, within each block the spot variation is governed by block parameters, which themselves are random variables. At the start of each new block, the spot parameters are reset according to these random variables.

The key simulation of this study is devoted to the cDNA targets, which nominally possess a circular shape. Owing to many factors, the actual shape may be highly noncircular. The model takes various random perturbations into account: (1) radius variation, (2) spot drifting locally, (3) center core variation, (4) chord removal, (5) edge noise, (6) edge enhancement, (7) signal intensity, and (8) signal response transform. Figure 4 shows a schematic drawing for the cDNA target simulation. The variables in the figure are explained in the following eight subsections.

Figure 4

cDNA microarray spot model.

2.2.1.

Variation of Radius

Prior to distortion and noise, the cDNA deposition spot is considered to be circular with random radius S. The mean of the radius is set according to the array density and its variance relates to the consistency of spot size. S is modeled by a normal distribution having mean μ_s and variance σ_s ², S∼N(μ_s,σ_s), with the standard deviation being a predetermined proportion, k_s, of the mean, or S∼N(μ_s,k_sμ_s). The radius mean is set for every block, and randomized over a small range within the array. The block randomness of μ_s is modeled by a uniform distribution, μ_s∼U(s_a,s_b). Figure 5 shows parts of blocks with spot radii depending on the number of spots in a block. For Figures 5(a)–5(c), the block portions are for block sizes (10,15), (25,45), and (25,45), respectively, where (col, row) denotes the number of spots in columns and rows within the block, respectively. Occasionally, a spot overlaps with it neighbors [Figure 5(c)] when k_s is set to a larger proportion. This situation simulates the condition where too much cDNA solution is deposited and/or the drying process may be slow in comparison to the liquid spreading process.

Figure 5

Figure shows the variability in spot size and spread from its size. The spot radius distribution is automatically set depending on the number of spots in a block (width, height). In the earlier example has (a) (10,15), μ_s∼U[23.3 24.3], (b) (20,25), μ_s∼U[12.6 13.6] and (c) (25,45), μ_s∼U[5.45 6.45], with standard deviation k_s=1, 7, 20 of radius, respectively.

Depending on the robot arm and printing ability of the pins, the interspot distance, G_sp, may vary. Owing to the physical mechanics of the robot arm, the block size (pixel units) is fixed in most cases. The interspot distance can be set to accommodate spot size and random variation in spot radii. The effects are illustrated in Figure 6, where the number of rows and columns are fixed.

Figure 6

Figure shows interspot grid spacing, (a) G_sp=3 pixels, μ_s∼U[9.5 10.5], (b) G_sp=6 pixels, μ_s∼U[8 9], (c) G_sp=10 pixels, μ_s∼U[6.5 7.5]. The example has (35,20) rows, columns respectively with k_s=0.05.

2.2.2.

Spot Drift

During the fabrication stage, the deposition of cDNA targets may not follow the predefined grid owing to print-tip rotation, vibration, or other mechanical causes. Other drifts are attributed to the slide’s coating properties and the drying rates of the cDNA. This displacement is modeled by possible random translations in the horizontal and vertical directions. Each spot has an equal probability, P_D, of drifting. If a spot is selected for drift, then the amounts of drift in both directions are random multiples of the current spot radius. The horizontal and vertical multiples, δ_x and δ_y, called the “drift levels,” are uniformly distributed: δ_x, δ_y, ∼U(d_a,d_b). The horizontal and vertical drifts are D_x=δ_xS and D_y=δ_yS, respectively. Interspot distance can be set according to the drift to minimize the impact of overlapping spots.

Some microarray scanners capture two fluorescent signals in two passes of scanning. Due to the mechanical homing error, the two fluorescent channels may not align exactly. In these settings, some small offset between the two channels can be observed. This offset may occur at subpixel resolution. To simulate this offset, the model offers a random offset between the centers of the two channels. It is achieved by randomly offsetting the spot center of the second channel by one pixel in either of the horizontal and vertical directions. These offsets are applied following application of the spot drifts. Figure 7 illustrates the spot drift.

Figure 7

Figure shows the effect of radius drift (P_d,d_a,d_b). (a) (0.05,5,100), (b) (0.25,15,100), (c) (0.5,50,100). As the activation probability with drift range is set higher, the spots drift away from its center.

It is essential for the image analysis algorithm to determine the exact location of the target spot so that an accurate measurement can be carried out without the interference of the dusty noise around the targets. Some algorithms rely on the assumption that the printing grid is rigid with the cDNA target in the center; others assume an imperfect printing process such that a deformable grid is necessary. The former method is faster and noise insensitive, but may be inaccurate if the slides are fabricated with many displacements; the latter is robust in target position detection, but can be rather slow and noise sensitive. In either case, the simulation outcome will provide a set of evaluation images to assess the tolerance of both algorithmic designs. The slightly misaligned channels also pose a challenge to signal intensity extraction.

2.2.3.

Doughnut Hole

Owing to the impact of the print tip on the glass surface, or possibly due to the effect of surface tension during the drying process, a significantly lesser amount of cDNA can be deposited in, or attached to, the center of the targets. Consequently, the center of the target emits less fluorescent photons, thereby giving a target the doughnut shape. It is critical for signal intensity extraction whether or not the center hole is assumed, particularly when the signal is weak and there is a large center hole. The simulation allows one hole in the center with varying size, along with a possible off-center displacement. It is not necessary to simulate more than one hole, since the mathematical properties for signal and noise estimation are preserved with this simple condition.

An elliptical shape models the inner core with random horizontal and vertical axes, H and V. The axes are modeled by a normal distribution whose parameters are randomized for each block within a given array: H∼N(μ_H,σ_H) and V∼N(μ_V,σ_V). Interarray variability in these radius distributions is modeled by uniformly distributed means: μ_H∼U(a_H,b_H), σ_H=α₁μ_H and μ_V∼U(a_V,b_V), σ_V=α₂μ_V, where the controlling ratios vary over a range, α₁, α₂∼U(P_a,P_b). The choice of the parameters governs the hole shapes. The center position of a hole is allowed to drift over a range. The shape is unaffected by the drift because the mechanical print tip to surface contact is unaffected. The amount of drift in the horizontal and vertical directions is modeled similarly to spot drift. Drift levels are set at every block, (δc_xR,δc_yR) and (δc_xG,δc_yG), for both channels. The amount of drift is first selected from a uniform range, δc∼U[i,j]. Channel and interchannel drifts are modeled by a uniform variate and set for each block: δc_xG=δcU[−1,1], δc_yG=δcU[−1,1], δc_xR=δc_xG+U[−1,1], and δc_yR=δc_yG+U[−1,1].

2.2.4.

Chord Removal

Since parts of a spot can be washed off due to various physical effects during the hybridization and washing stages, pieces of a spot may be missing. We would like to simulate this condition for the same reasons that the center hole is simulated. This irregularity is modeled by randomly cutting chords from the circular spots. The number of chords to be removed, N_c, for a spot is selected from a discrete distribution, {0, 1, 2, 3, 4}, where the elements of the distribution occur with probabilities p₀, p₁, p₂, p₃, and p₄, respectively. For images with very few pieces cut off, the zero-chord probability p₀ is very high, and the three- and four-chord probabilities are close to 0 (possibly equal to 0). To model interarray variability, the probabilities can be treated randomly.

Once the number of chords for a spot is determined, the distance, L, of each chord center to the edge is selected from a beta distribution: L∼B(α_L,β_L). Interblock variability is modeled by allowing α_L and β_L to be randomly selected from uniform distributions: α_L∼U(a_α,b_α), and β_L∼U(a_β,b_β). Owing to the large family of shapes generated by beta distributions, this provides a wide range of distributions for L. Finally, the chord locations are chosen uniformly randomly according to an angle θ∼U(0,2π). Figure 8 illustrates the effect of selecting increased chord rates: (a) p₀=0.70, p₁=0.30; (b) p₀=0.20, p₁=0.40, p₂=0.25, p₃=0.15; (c) p₀=0, p₁=0.10, p₂=0.40, p₃=0.30, p₄=0.20.

Figure 8

Figure shows different chord rate settings for each of the slide. The probability weights for (0,1,2,3,4) chord rates were set at following levels. (a) (0.7,0.3,0.0,0,0), (b) (0.2,0.4,0.25,0.15,0), (c) (0.0,0.1,0.4,0.3,0.2), respectively. Chord rate is reset at the beginning of a block.

2.2.5.

Edge Noise

Owing to the manner in which liquid dries, the spots usually do not have smooth edges. To provide a realistic visual effect, as well as to pose a challenge if edge detection algorithms are under consideration, we simulate this irregular edge effect via parameterized noise using a binary edge-noise algorithm employed in digital document processing.²⁵ After determining the target shape by cutting the center hole, removing possible chords, and possibly creating drift, and prior to simulating the signal intensity, the spot is still in its binary format, and thus the binary edge-noise algorithm can be applied directly. Edge noise is applied to both the outer perimeter of the spot and the inner perimeter containing the hole.

The algorithm begins by first generating a white noise (mask) image having range [0, max intensity]. A 3×3 averaging filter is applied to the white-noise image to arrive at a noise image N that possesses a degree of correlation resembling the noise characteristics of various physical processes, including printing processes. The edge of a binary image can be considered to consist of two parts, inner and outer borders. In our case, the spot radius is known and so are these borders. The inner border is formed by morphologically eroding the image by a 3×3 structuring element and then subtracting the erosion from the original image. The outer border is formed by morphologically dilating the image by a 3×3 structuring element and then subtracting the original image from the dilation. To apply noise to the inner border, a threshold, mid+δ, just above midpoint is applied to N, this binary image is ANDed with the inner border of the original binary spot S, and the result is XORed with S. Noise is applied to the outer border by thresholding N just below the midpoint (mid−δ), complementing, and then ANDing with the outer border of S. This noisy outer border is then ORed with the image possessing inner border noise to yield the edge-degraded binary spot S^′. The process is mathematically described by

Eq. (1)

S^{'} = [(N_{mid + δ} \cap S_{in}) Δ S] \cup [{(N_{mid - δ})}^{c} \cap S_{out}],

where δ controls the threshold and hence the edge noise, and Δ denotes the symmetric difference. δ is used as controlling parameter. S^′ is a binary mask giving the spatial domain of the spot. Figure 9 shows edge noise for various δ thresholds.

Figure 9

Figure shows the edge noise on the spots. Noise controlling parameter (δ) can be set from [0,1.0]. The example shows an increased edge noise effect, where (a) δ=0.25, (b) δ=0.1, (c) δ=0.03 , where δ is the proportion of maximum intensity.

2.2.6.

Signal Intensity

Simulation of signal intensity is divided into three steps. First, it is assumed that the fluor-tagged mRNAs cohybridized to a single slide are from the same cell type, and therefore the signals from the two fluorescent channels are supposed to be identical, with some variation. Second, some percentage of genes may be selected as significantly over- or underexpressed. Third, foreground noise is added to the entire array to simulate the normal scanning integration process.

It is well known that the distribution of gene expression levels within a cell closely follows an exponential distribution.²⁶ Given a microarray containing N genes, the intensity levels I_k, for k=1,…,N, assumed to be related to the expression levels of N genes, are simulated by an exponential distribution. This intensity level I_k is considered to be the ground-truth signal that is not directly measurable from the microarray, since from either biological or bio-chemical processes, from mRNA extraction up to the hybridization process, some variation will be introduced into measurement of final fluorescent signal strength. For each microarray, a particular exponential distribution with mean β is first chosen (for a detection system with gray-level up to 65 535, β is usually selected around 3000). Then at each spot location, which we assume to represent one unique gene, one ground-truth signal level I_k is generated from the exponential distribution. For two observable measurements (R_k,G_k) from two fluorescent channels, two numbers are generated from a normal distribution with mean of I_k and standard deviation of αI_k, where α is a predetermined coefficient of variation, which is usually about 5–30 depending on the assumed biological relation between the two channels.

To include outlier expression levels that reflect certain realistic conditions,³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ¹⁰ ¹⁴ one may select 5–10 of the spots to be either over- or underexpressed. This condition is achieved by selecting the genes from the entire microarray based on a probability, p_outlier (e.g., p_outlier=0.05 for 5 outliers), and then selecting the targeted expression ratio for the kth gene

Eq. (2)

t_{k} = 10^{\pm b k},

where b_k satisfies a beta distribution, b_k∼B(1.7,4.8), and where the +/− sign is selected with equal probability. Upon obtaining a targeted expression ratio, the algorithm converts the expression intensities from the two fluorescence channels by

Eq. (3)

R_{k}^{'} = R_{k} \sqrt{t_{k}},

G_{k}^{'} = \frac{G_{k}}{\sqrt{t_{k}}},

where R_k ^′ and G_k ^′ denote the signal values after the conversion.

Upon obtaining the signal intensities for each spot, (R_k ^′,G_k ^′), each pixel within the spot binary mask derived from steps 2.2.1 to 2.2.5 is filled with the signal intensity. Normally distributed foreground noise is then added pixel-wise. This yields, at each pixel, the intensities SR=R_k+I_f1 and SG=G_k+I_f2, where I_f1∼N(μ_{R_k},σ_{R_k} ²), I_f2∼N(μ_{G_k},σ_{G_k} ²) and μ_{R_k}∼R_k ^′U[f_a₁,f_b₁], σ_{R_k}∼μ_{R_k}U[f_c₁,f_d₁], μ_{G_k}∼G_k ^′U[f_a₂,f_b₂], and σ_{G_k}∼μ_{G_k}U[f_c₂,f_d₂]. In the remainder of the paper, α’s are used to denote the uniform variables α_m₁∼U[f_a₁,f_b₁], α_m₂∼U[f_a₂,f_b₂], α_s₁∼U[f_c₁,f_d₁], and α_s₂∼U[f_c₂,f_d₂].

2.2.7.

Channel Conditioning

Owing to various reasons, such as imprecise quantities of starting mRNA for the two channels, different labeling efficiencies, or uneven laser powers at the scanning stage, in actual microarray experiments there may not be equal intensities even if two channels use exactly the same labeled mRNA. Moreover, one may not be able to assume that the fluorescent intensity is linearly related to the expression level. In fact, it is very difficult to determine the exact form of the response function from expression level to intensity due to the complex combination of bio-chemistry to photon electronics. We choose a family of functions that covers most of the understandable conditions, shown in Figure 10, such as delayed response, saturation (which is an embedded feature in the digital system since no gray level can pass 16-bit binary digits in a typical microarray system), and unbalanced channel intensity. This simulation is intended to facilitate understanding as to what is the best way for expression ratio normalization, whether linear based methods will be sufficient or nonlinear based methods will be necessary. The function family is characterized by four parameters, (a₀,a₁,a₂,a₃), and the function form is given by

Eq. (4)

f (x) = a_{3} [a_{0} + x {(1 - e^{- x / a_{1}})}^{a_{2}}]; a_{3} > 1 .

Having chosen a function from the family, the expression levels, R^′ and G^′, from each fluorescent detection channel are then transformed by the detection system response characteristic function defined by f_R(x) or f_G(x) to obtain the realistic fluorescent intensity observed. The observed fluorescent intensities are

Eq. (5)

R_{k}^{″} = f_{R} (R_{k}^{'}),

G_{k}^{″} = f_{G} (G_{k}^{'}),

where f_R or f_G may take different parameters for each fluor-tagging system. The simulation performs the following steps for signal placement to emulate the real process affecting the signal spots.

Figure 10

Fluorescent detection response characteristic functions. In all figures, middle (blue) curve is the reference function with parameters of (a₀,a₁,a₂,a₃)=(0,100,−1,1). Also, in all figures, the x axis is the input signal intensity, and y axis is the observed signal intensity, and both are in log ₁₀ scale. (a) Delayed response at various levels, with fixed a₀=0 and a₃=1. (b) Different amplification levels, with fixed a₀=0 and a₂=−1. (c) Different response curvature, with fixed a₀=0 and a₃=1. (d) Some other parameter settings, with fixed a₃=1.

(1) Generate ground truth expression signal I_k (k=1,…,N) for every gene by exponential distribution (see Sec. 2.2.6).
(2) Let R_k∼N(I_k,∝I_k) and G_k∼(I_k,∝I_K). If a self–self experiment needs to be simulated, skip steps 3 and 4.
(3) If we simulate an experiment with two different samples, some outlier genes are selected and then their intensities are altered. We obtain (R^′,G^′) from (R,G) for all genes [see Sec. 2.2.6, and Eqs. (2) and (3)].
(4) If we simulate a fluorescent system with imperfect response characteristics, the intensities are further converted by R^″=f_R(R^′) and G^″=f_G(G^′) (see Sec. 2.2.7).
(5) The actual simulated fluorescent intensities for both channels are obtained by applying additional variation via a normal distribution function SR=R^″+N(μ_R,σ_R ²), where μ_R=α_m1R^″, σ_R=α_s1μ_R, and similarly for signal G (see Sec. 2.2.6).

The scatter plots in Figure 11 show the effects of the channel normalization. By choosing different parameter sets, one can simulate many of the situations observed in real microarray images.

Figure 11

Possible scatter plot due to various response conversions for different fluorescent channels. 10 000 data points (gene expression levels) were generated by the exponential distribution with mean of 3000. After passing, through two fluorescent channels [with some response characteristic functions as shown in parts (a)–(c)], data variations were added by passing each data point through a normal distribution with the standard deviation to be 15 of mean expression signal. (a) Without any alteration [or equivalently, set parameters for the response function to be (a₀,a₁,a₂,a₃)=(0,1,−1,1) ], and assume the signal intensities from red channel and green channel are equivalent (a simulated self–self experiment). (b) Banana shape. Intensity in green channel pass a response function with parameters (a₀,a₁,a₂,a₃)=(0,500,−1,1), where red channel takes the parameters (0,10,−1,1). (c) Sinusoid-shape. The red channel’s response function with parameters (0,100^1/0.7,−0.7,1), and the green channel with (0,100^1/0.9,−0.9,1).

2.2.8.

Edge Enhancement

Under some fabrication conditions, such as incorrect humidity control, where the cDNA solution tends to accumulate towards the outer edge during the drying process, the spot edge may appear brighter than the rest of the spot. This phenomenon is modeled by randomly enhancing the edge. The number, N_e, of pixels from the edge to be enhanced is fixed. The enhancement, W_ed, is added to the original intensity. W_ed satisfies a normal distribution, W_ed∼N(μ_e,1). Randomness between blocks is modeled by making μ_e uniformly distributed, μ_e∼U(l_a,l_b).

2.3.

Postprocessing Simulation

Most postprocessing steps simulate handling and scanning artifacts: scratch noise resulting from improper handling of microarray slides, spike noise arising from the impurity of mRNA extraction steps or perhaps insufficient washing conditions, snake noise due to the accumulation of dust if the slides have sat in open space too long, and last, but not least, smoothing resulting from many scanners’ averaging effects or integration processes. For the most part, these steps model the interaction between signal and noise in the spatial domain, which causes pixel-wise nonlinear degradation. It is expected that the microarray image analysis software shall be able to handle most of the noise conditions outlined here in order to measure the signal precisely.

2.3.1.

Spike Noise

In a practical biology laboratory, it is not necessary to maintain a dust-free environment. Hence, fine microscopic dust particles are nearly impossible to avoid. On laser excitation, these particles fluoresce to give high intensity spikes. Moreover, in some cases, bad mixtures of cDNA solutions result in precipitation, and these particles fluoresce with a very high intensity. These effects are simulated by adding spike noise at a preset rate. Such intensity spikes are added randomly across the entire slide area, the number of such noise pixels being preset in terms of the total number of pixels in the array. The amount of spike noise in an array is set with reference to the percentage, L_spi, of the total number of pixels in the array. Typical low to high noise levels are to be set by selecting 0.1–10. Once a pixel is selected for spike noise, the adjacent pixels have a higher probability of being affected. Thus, a random number, W_spi, of pixels are chosen in an arbitrary direction to be influenced by this noise. The intensity, N_S, of the spike noise is governed by an exponential distribution with mean μ_spi. In Figure 12, the exponential mean is fixed but the spike level is increased through the parts of the figure.

Figure 12

Figure shows increased spike noise levels L_spi. (a) Level of 0.1, (b) level of 5, (c) level of 10, exponential rate range is maintained.

2.3.2.

Scratch Noise

Physical handling of the array slides can result in surface scratches. These typically result in low intensity levels. Scratch-noise intensity is parameterized as a ratio, κ_sc, giving the background-to-scratch-noise intensity level. Other parameters are the number of strips, strip thickness W_sc, and a random strip length, L_sc, given as a multiple of the spot size. The latter is modeled as a uniform distribution: L_sc∼U[L_sc1,L_sc2]. Strips are placed at random positions on the array, and are inclined according to a (discrete) uniformly random angle, θ_sc∈{0°,45°,90°,135°,180°}. Figure 13 shows the noise for incremental parameter settings: (a) L_sc∼U[2,7], κ_sc=2.0, W_sc=four pixels; (b) L_sc∼U[5,10], κ_sc=3.0, W_sc=seven pixels; (c) L_sc∼U[7,15], κ_sc=4.0, W_sc=ten pixels. The number of strips is fixed at 7.

Figure 13

Figure shows scratch noise with its parameter settings. Number of scratches is maintained to 7 in the earlier examples. Following are the parameter (a) L_sc∼U[2 7], κ_sc=1.5, W_sc=3 pixels, (b) L_sc∼U[5 15], κ_sc=2.5, W_sc=7 pixels, (c) L_sc∼U[8 45], κ_sc=4.0, W_sc=15 pixels. The noise factor k_sc=0.1.

2.3.3.

Snake Noise

Fine fabric dust particles on the slides can create snake-tailed strips on laser excitation. These strips are normally higher intensity than the signal level. To simulate this noise, an equiprobable multidirectional snake noise has been generated consisting of some number, N_seg, of segments. Analogously to scratch noise, the intensity is parameterized as a ratio, κ_sn, giving the average-signal-to-snake-noise intensity level, the number of snakes, snake thickness W_sn, and a random length, L_sn, given as a multiple of the spot size. The latter is modeled as a uniform distribution: L_sn∼U[L_sn1,L_sn2]. Figure 14 shows the noise for incremental parameter settings: (a) N_seg=5, L_sn∼U[5,10], κ_sn=0.50, W_sn=two pixels; (b) N_seg=10, L_sn∼U[5,30], κ_sn=0.33, W_sn=three pixels; (c) N_seg=15, L_sn∼U[15,80], κ_sn=0.25, W_sn=five pixels.

Figure 14

Example shows different parameter setting for snake noise. In this example (a) N_seg=5, L_sp∼U[5 10], κ_sn=0.5, W_sp=2 pixels, (b) N_seg=10, L_sp∼U[5 30], κ_sn=0.33, W_sp=3 pixels, (c) N_seg=15, L_sp∼U[5 80], κ_sn=0.25, W_sp=5 pixels, respectively. Direction of the tail was randomly chosen with equal probability for each.

2.3.4.

Smoothing Function

Addition of various noise types makes the microarray highly peaked with high pixel differences. This stark irregularity can be mitigated by smoothing the image with either a flat or pyramidal convolution kernel. The kernels are shown in Figure 15. The effect of smoothing is illustrated in Figure 16, where the three-dimensional (3D) profile of an originally noised image is shown, along with versions smoothed by flat and pyramidal kernels. Either smoothing kernel can be chosen.

Figure 15

Example shows the 3×3 convolution kernel for (a) flat function and (b) pyramidal function.

Figure 16

Example shows the 3D profile before and after smoothing. Where (a) noised, (b) flat function, (c) pyramid function.

2.4.

Image Generation and Parameter I / O

Parameters governing the effects described in the preceding sections form the input (through a file) to the synthetic array software. These include parameters for array dimensions, shape parameters, and noise processes. All relevant information, such as spot size, position, various drifts (center hole, spot), noise processes, (foreground, spike, snake, scratch, etc.), and chord rate, are recorded for every spot printed on the synthetic array. Block controlling parameters and the array information are also recorded. The recorded information contains the true signal for the synthetic microarray. This can be used subsequently to analyze various signal processing tools.

TIFF format is widely used due to platform independence and flexibility of data representation. The synthetic images are generated in TIFF with sample (pixel) resolution of two bytes for every color (R,G). Both monochrome and color images (R, G as two block and interlaced R, G, with dummy B) are generated. Standard freeware routines (http://www.libtiff.org) are used to generate these formats. The image file is written in blocks, where the size of the block (commonly called “strip”) is set equal to the image width. The image data is written in the native order (big-endian, little-endian) of the host CPU on which the library is compiled. Image data quality is maintained by disabling compression and other special options available in these routines and formats.

2.5.

Summary of Model Parameters

The cDNA microarray printing process can be categorized and grouped into independent events. Each event is probabilistically described by assigning a distribution, as previously described. Due to the physical nature of the process, there exist variations between events. This variation is described by randomization of the controlling parameters (second level randomization). The parameter randomization can be broadly grouped as (i) randomization at spot level, (ii) randomization at block level, and (iii) randomization at array level. The parameters are grouped and mathematically described in Table 1.

Table 1

Parameter settings for the cDNA microarray simulation.
Level	Simulation	Parameter descriptions	Distribution
SPOT	Spot size	S: Spot radius with (μ_s,σ_s ²)	S∼N(μ_s,σ_s ²)
	Spot drift	δ_x,δ_y: Drifting level	δ_x,δ_y∼U(d_a,d_b)
		d_a,d_b: percentage of spot radius
		P_D: Drift activation probability	D_x=δ_xSU[−1,1]
		D_x,D_y: Relative drifting	D_y=δ_ySU[−1,1]
		(X₁ ^′,Y₁ ^′): Drifted center coordinates	${_{Y_{1}^{'} = Y + D_{Y}}^{X_{1}^{'} = X + D_{X}} {_{Y_{2}^{'} = Y_{2}^{'} + U [- 1, 1]}^{X_{2}^{'} = X_{1}^{'} + U [- 1, 1]}$
		(X₂ ^′,Y₂ ^′): Second channel, where (X,Y) is predefined spot center coordinates
	Inner hole size	H, V: Horizontal and vertical axis of the inner elliptical hole	H∼N(μ_H,σ_H) V∼N(μ_V,σ_V)
	Inner hole drift	X_C,Y_C: Ideal spot center	X_R=X_C+δc_xR
	Inner hole drift	X_R,Y_R: First channel coordinates	Y_R=Y_C+δc_yR
		X_G,Y_G: Second channel coordinates where	X_G=X_C+δc_xG
		δc_xG,δc_yG,δc_xR,δc_yR: drift level set at the block level	Y_G=Y_C+δc_yG
	Chord removal	P_{N_c}: Chord removal probability ( p_k: probability of k chords to be removed from a target spot)	P_{N_c}={p₀,p₁,p₂,p₃,p₄}, where p₀+p₁+p₂+p₃+p₄=1 N_c∼{0,1,2,3,4}
		L: Chord length	L∼B(α_L,β_L)
		θ: Chord position	θ∼U(0,2π)
	Spot intensity	β: Mean intensity for the assumed cell system	I_k∼Exp(β)
		R_k,G_k:kth spot (fixed) signal intensities for both channels	R_k∼N(I_k,σ_I) G_k∼N(I_k,σ_I)
		α: Coefficient of variation of signal intensity in the system	σ_I=αI_k
	Outlier’s intensity	p_outlier: Outlier activation probability
		b_k: Outlier control level	b_k∼Beta(1.7,4.8)
		t_k: Targeted outlier expression ratio, with equal-probability for +/− sign	t_k=10^±b _k
		R_k ^′,G_k ^′:kth outlier signal intensities for both channels	$R_{k}^{'} = R_{k} \sqrt{t_{k}}$ $G_{k}^{'} = G_{k} / \sqrt{t_{k}}$
	Channel conditioning	R_k ^″,G_k ^″: Prenormalized signal intensity of the spots on red, green channels	R_k ^″=f₁(R_k ^′) G_k ^″=f₂(G_k ^′)
		a₀,a₁,a₂, and a₃, parameters for response characteristic function.	f(x)=[a₀+x(1−e^−x/a₁)^a₂]a₃; where a₃>1
	Spot signal variation—foreground noise	SR_k,SG_k: Pixel-wise (x,y) signal intensity	SR_k(x,y)∼R_k ^″+N(μ_{R_k ^″},σ_R ²) SG_k(x,y)∼G_k ^″+N(μ_{G_k ^″},σ_G ²)
		α_s: Within spot signal coefficient of variation	${_{μ_{G_{k}^{"}} = G_{k}^{"} α_{m_{2}}; α_{m 2} ~ U [f_{a_{2}}, f_{b_{2}}]}^{μ_{R_{k}^{"}} = R_{k}^{"} α_{m_{_{1}}}; α_{m_{1}} ~ U [f_{a_{1}}, f_{b_{1}}]}$
			${_{σ_{G} = α_{s_{2}} μ_{G_{k}^{"}}; α_{s_{2}} ~ U [f_{c_{2}}, f_{d_{2}}]}^{σ_{R} = α_{s_{1}} μ_{R_{k}^{"}}; α_{s_{1}} ~ U [f_{c_{1}}, f_{d_{1}}]}$
	Edge enhancement	W_ed: Level of enhancement, parameter (μ_e) set for the block	W_ed∼N(μ_e,1)
		N_e: Number of pixels enhanced
	Edge noise	Apply edge noise at the set level (δ_ed)
BLOCK	Radius parameters	μ_s,k_s: mean and radius deviation factor	μ_s∼U(s_a,s_b) σ_s∼k_sμ_s
		s_a,s_b: bounds of radius, set by block size and inter spot gap
	Chord parameters	N_c: Chord rate picked with equal probability	N_c∈U{0,1,2,3,4} having weights {p₀,p₁,p₂,p₃,p₄}
		α_L,β_L: Chord distributional parameters	α_L∼U(a_α,b_α),β_L∼U(a_β,b_β),
	Inner hole parameters	μ_H,μ_V,σ_H,σ_V: Parameters for inner elliptical hole	μ_H∼U(L_a,L_b)μ_s, μ_V∼U(L_a,L_b)μ_s
		μ_s: Mean spot radius in the block	σ_H=α₁μ_s,σ_V=α₂μ_s
			α₁∼U(P_a,P_b),α₂∼U(P_c,P_d)
	Drift parameters	δc_xG,δc_yG,δc_xR,δc_yR: drift level	δc∼U[i,j]
	Drift parameters	i, j: Percentage of the spot radius	δc_xG=δcU[−1,1],δc_yG=δcU[−1,1]
			δc_xR=δc_xG+U[−1,1],δc_yR=δc_yG+U[−1,1]
	Enhancement	l_a,l_b: Range of intensity ratio. Set mean level of enhancement for a block	μ_e∼U(l_a,l_b)
ARRAY	Physical dimensions	B_w,B_h: Block size—width, height (distance between first spot centers of any two block)	Typical Setting for a 8 blocks, 2 row array (in pixels):
		M_l,M_r,M_t,M_b: Margin settings (left, right, top, bottom)	B_h,B_w=900 M_l,M_r,M_t,M_b=100
		N_pin,N_row: Number of pins in an array, printed equally across N_row number of rows
		NS_w,NS_h: Number of spots along the width (NS_w) and height (NS_h) of the block
	Signal to noise ratio	SNR: Signal to noise level is set for an array
	Interspot distance	G_sp: Interspot distance, set for an array
	Background	I_{b_ch1},I_{b_ch2}: Background intensity, with parameters set for an array	I_{b_ch1}∼N(μ_b,σ_b₁ ²) I_{b_ch2}∼N(μ_b,σ_b₂ ²)
		γ: Background level Parameter settings:	γ∼U[a,b]
		—Flat fluorescent background	μ_b=γ,
		—Functional background g(x,y): choice of parabolic, positive or negative slant surface function	μ_b=γg(x,y), with, σ_b₁=(k_b₁μ_b),σ_b₂=(k_b₂μ_b)
	Spike noise	L_spi: Level of spike noise (set in terms of percentage of total pixels)
		N_s: Intensity of the spike noise	N_s∼Exp(μ_spi),
		μ_spi: Noise rate	μ_spi∼U[e,f]
		W_spi: Width of the noise cluster	W_spi∼U[g,h]
	Edge noise	δ_ed: Set the controlling parameter	δ_ed set as a percentage of maximum intensity value
	Snake noise	N_seg: Number of snake tails in an image	N_seg,κ_sn,L_sn,W_sn
		I_sn: Intensity of the noise tail	I_sn∼N(μ_sn,σ_sn),
		κ_sn: Average signal-to-snake-noise intensity level	μ_sn=(I_k/κ_sn),σ_sn=k_snμ_sn
		L_sn: Length of the segment expressed as multiples of average spot size	L_sn∼U[L_sn1,L_sn2]
		W_sn: Width of the snake noise tail
	Scratch noise	N_sc: Number of scratch tails in an image	N_sc,κ_sc,W_sc, θ
		I_sc: Intensity of the scratch noise	I_sc∼N(μ_sc,σ_sc)
		κ_sc: Average background-to- scratch-noise intensity level	μ_sc=(μ_b/κ_sc),σ_sc=k_scμ_sc
		L_sc: Length of the segment in units of average size of the spots	L_sc∼U[L_sc1,L_sc2]
		W_sc: Width of the scratch noise	θ∈U{0°,45°,90°,135°,180°}
		θ: Scratch noise inclination

Each noise type is categorized into one of the three groups and individually parameterized. Some are related to another noise parameter; others are independent. Each noise parameter is assigned a statistical distribution fitting its nature. For instance, consider spot radius. Spot radius obeys a normal distribution (μ_s,σ_s ²), where the mean spot radius (μ_s) is randomly picked over a small range (s_a,s_b) at the block level. This spot size range is set for an array depending on a user setting: the number of spots in a block (NS_w,NS_h) at the array level. If a noise type needs to be suppressed, then the corresponding parameters can be set small to nullify its effect. For example, inner spot hole follows a normal distribution along its vertical (μ_H,σ_H) and horizontal (μ_V,σ_V) axes. Its parameters are randomly picked from a preset range (L_a,L_b) and related to the mean spot radius (μ_s) at the block level [μ_H∼U(L_a,L_b)μ_s,μ_V∼U(L_a,L_b)μ_s]. For small or negligible doughnut holes, this preset range can be set small, or even null for perfect spots. The table is perused from spot level to the array level, tagging through the corresponding parameters, as indicated in the earlier examples.

3. Examples of Simulated Microarrays and Image Analysis

All of the described process and noise effects are controlled by appropriate parameter selection. Depending on the parameter setting, the arrays can be roughly classified as ideal, average, or noisy. Given a good printing run (no mechanical deposition problems), a relative matured hybridization protocol, and good RNA samples, along with a scanner of minimal optical warping, focusing, and integration problems, we expect a high-quality (ideal) microarray image. The corresponding simulated ideal image will have a flat mean background with typical autofluorescence variation (<10 of mean background level, but no less than square root of the mean background level), minimum spike/scratch/snake noise, little edge enhancement and no channel conditioning problems. For average image quality, one would expect larger background variation and possibly a slanted mean level. There will also be more spike/scratch/snake noise interfering with signal spots. In a noisy setting, besides higher noise levels for various possible interference, one would also expect uneven background level (e.g., parabolic function), heavy spot deformity (chord cuts, edge enhancement, and large inner holes), and different channel conditioning [such as the banana shape in the intensity scatter plot shown in Figure 11(b)].

Figure 17 shows two microarrays generated with NS_w=35 rows and NS_h=25 columns, at B_h=B_w=900 pixels per block. Array boundaries are set at (M_t,M_l,M_r,M_b)=(100,100,100,100). By choosing parameters, two different array qualities have been generated. Part (a) illustrates an ideal microarray image with normal background and parameters β=3000, SNR=2.0, α=0.05, G_sp=6, P_D=0.05, (d_a,d_b)=(2,15), (k_b₁,k_b2)=(10,10), P_outlier=0.05, L_spi=0.3, δ_ed=0.3:

(f_{a_{1}}, f_{b_{1}}, f_{c_{1}}, f_{d_{1}}) = (2, 8, 2, 6),

(f_{a_{2}}, f_{b_{2}}, f_{c_{2}}, f_{d_{2}}) = (2, 8, 2, 8),

(a_{0}, a_{1}, a_{2}, a_{3}) = (0, 1, - 1, 1),

(b_{0}, b_{1}, b_{2}, b_{3}) = (0, 1, - 1, 1),

(l_{a}, l_{b}, N_{e}) = (1, 3, 3),

(p_{0}, p_{1}, p_{2}, p_{3}, p_{4}) = (0.97, 0.03, 0, 0, 0),

(K_{SN}, L_{SN 1}, L_{SN 1}, W_{SN}, N_{SN}) = (0.25, 10, 50, 1, 2),

(K_{SC}, L_{SC 1}, L_{SC 2}, W_{SC}, N_{SC}) = (3, 5, 35, 3, 1) .

Figure 17

This example shows full size arrays simulation with different parameter settings: (a) good quality has SNR of 2.0, with normal background, spike noise L_spi=0.3, (b) noisy array with SNR of 1.1 with parabolic background noise, spike noise L_spi=15.

Part (b) illustrates a noisy microarray image with parabolic background and parameters: β=3000, SNR=1.1, α=0.25, G_sp=4, P_D=0.4, (d_a,d_b)=(15,100), (k_b₁,k_b2)=(25,25), P_outlier=0.7, L_spi=15, δ_ed=0.03:

(f_{a_{1}}, f_{b_{1}}, f_{c_{1}}, f_{d_{1}}) = (6, 12, 8, 20),

(f_{a_{2}}, f_{b_{2}}, f_{c_{2}}, f_{d_{2}}) = (6, 12, 8, 20),

(a_{0}, a_{1}, a_{2}, a_{3}) = (0, 500, - 1, 1),

(b_{0}, b_{1}, b_{2}, b_{3}) = (0, 10, - 1, 1),

(l_{a}, l_{b}, N_{e}) = (10, 40, 3),

(p_{0}, p_{1}, p_{2}, p_{3}, p_{4}) = (0.05, 0.3, 0.25, 0.25, 0.15),

(K_{SN}, L_{SN 1}, L_{SN 1}, W_{SN}, N_{SN}) = (0.25, 60, 110, 2, 10),

(K_{SC}, L_{SC 1}, L_{SC 2}, W_{SC}, N_{SC}) = (0.25, 60, 110, 2, 10) .

To illustrate how the simulation can be used to analyze microarray image software, we apply the ArraySuite¹¹ software to extract the image intensities and ratios from the image and then compare these to the corresponding intensities and ratios used for simulation. We use the ideal case to illustrate the utility of the simulation. In Figure 18(a), intensities from one fluorescent channel have been extracted (y axis) and plotted against the simulation signal intensities. The extracted signal generally corresponds well to the simulated signal, with some variation. After excluding intensities less than 300, the mean and standard deviation of the difference between the two log ₁₀ -transformed intensities are 0.016 (or 10^0.016=1.038 ) and 0.038 (or 10^0.038=1.09 ), respectively. The ratio comparison is given in Figure 18(b). When signal intensity is weak (less than 300), various noise components in the simulation process affect the accuracy of the signal extraction program. Since the problem is unavoidable, a measurement quality metric is necessary to provide confidence in downstream data analysis. In this case, we see that if the signal intensity is less than 300, then the noise interaction is significant.

Figure 18

Comparison between simulated signal (ideal setting) vs extracted signal from microarray image analysis program. (a) Signal extracted from one fluorescent channel (y axis) comparing to the signal used for simulation in the same channel (x axis). (b) Ratios from microarray image analysis program (y axis) comparing to the ratios generated by the simulation (x axis).

4. Conclusion

Modeling and simulation of microarray image formation is a key to benchmarking various signal processing tools being developed to estimate cDNA signal spots. Using a model to describe the signal ground truth not only helps in evaluating these tools, but also facilitates the understanding of various process interactions. To illustrate how the image-simulation program presented in this paper can be used in the development of image-analysis software, we describe an actual case.

The simulation program has been used extensively in the design of the microarray image-analysis program used at the National Human Genome Research Institute. This has been done by testing the accuracy of the analysis program on simulated images exhibiting troublesome noise conditions and then tuning the program to achieve better results. One such application concerns large and overlapping spots, as illustrated in Figure 19(a), which shows part of an actual hybridized image in which some spots are substantially larger than intended owing to randomness in the cDNA deposition procedure. This defect causes various problems, one being poor background estimation. We illustrate this problem by simulating an image with large spot size variation and drifting conditions [Figure 19(b)]. If the image analysis program extracts the local background by averaging the region around the bounding box (which was used as a starting condition in an earlier version of the NHGRI program), an elevated background average may be obtained since the bounding box may overlap neighboring targets that are large in size and strong in expression level. An additional problem is that some weak targets may not be detected [Figure 19(c)]. Based on these considerations, the program has been modified to calculate the four average intensities from the four corners and the four average intensities from the four sides of the bounding box, and then take the minimum among all of these as the initial estimation of the local background. A histogram-based method is then invoked around the initial estimated background to further improve the estimation. The output from Figure 19(b) according to the modified program is shown in Figure 19(d): the weak target is detected and there is improved local background estimation for all spots.

Figure 19

(a) Part of actual hybridized image with spots larger than average; (b) simulated microarray with larger spots and spots overlapping with their neighbors; (c) original background intensity extraction program produces undetected spot (target in the middle without outer boundary); (d) improved background extraction program more accurately measures the local background intensity and effectively allows detection of weak targets.

REFERENCES

1.

M. Schena , D. Shalon , R. W. Davis , and P. O. Brown , “Quantitative monitoring of gene expression patterns with a complementary DNA microarray,” Science , 270 467 –470 (1995). Google Scholar

2.

J. DeRisi , L. Penland , P. O. Brown , M. L. Bittner , P. S. Meltzer , M. Ray , Y. Chen , Y. A. Su , and J. M. Trent , “Use of a cDNA microarray to analyse gene expression patterns in human cancer,” Nat. Genet. , 14 (4), 457 –60 (1996). Google Scholar

3.

P. T. Spellman et al.;, “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Mol. Biol. (Moscow) , 9 (12), 3273 –3297 (1998). Google Scholar

4.

J. Khan , R. Simon et al.;, “Gene expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays,” Cancer Res. , 58 (22), 5009 –5013 (1998). Google Scholar

5.

T. R. Golub , D. K. Slonim , P. Tamayo , C. Huard , M. Gaasenbeek , J. P. Mesirov , H. Coller , M. L. Loh , J. R. Downing , M. A. Caligiuri , C. D. Bloomfield , and E. S. Lander , “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science , 286 (5439), 531 –537 (1999). Google Scholar

6.

V. R. Iyer et al.;, “The transcriptional program in the response of human fibroblasts to serum,” Science , 283 (5398), 83 –87 (1999). Google Scholar

7.

M. Bittner , P. Meltzer et al.;, “Molecular classification of cutaneous malignant melanoma by gene expression profiling,” Nature (London) , 406 (6795), 536 –540 (2000). Google Scholar

8.

A. A. Alizadeh et al.;, “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature (London) , 403 (6769), 503 –511 (2000). Google Scholar

9.

I. Hedenfalk , D. Duggan , Y. Chen , M. Radmacher , M. Bittner , R. Simon , P. Meltzer , B. Gusterson , M. Esteller , M. Raffeld , Y. Yakhini , A. Ben-Dor , E. Dougherty , J. Kononen , L. Bubendorf , W. Fehrle , S. Pittaluga , S. Gruvberger , N. Loman , O. Johannsson , H. Olsson , B. Wilfond , G. Sauter , O. Kallioniemi , A. Borg , and J. Trent , “Gene-expression profiles in hereditary breast cancer,” N. Engl. J. Med. , 344 (8), 539 –548 (2001). Google Scholar

10.

J. Khan , J. S. Wei , M. Ringne´r , L. H. Saal , M. Ladanyi , F. Westermann , F. Berthold , M. Schwab , C. R. Antonescu , C. Peterson , and P. S. Meltzer , “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nat. Med. (N.Y.) , 7 (6), 673 –679 (2001). Google Scholar

11.

Y. Chen , E. R. Dougherty , and M. Bittner , “Ratio-based decisions and the quantitative analysis of cDNA microarray images,” J. Biomed. Opt. , 2 (4), 364 –374 (1997). Google Scholar

12.

P. Kalocsai and S. Shams , “Use of bioinformatics in arrays,” Methods Mol. Biol. , 170 223 –236 (2001). Google Scholar

13.

See www.imgresearch.com, genome-www.stanford.edu/microarray, www.axon.com, www.imagingresearch.com, and www.nutecsciences.com.

14.

D. J. Duggan , M. L. Bittner , Y. Chen , P. S. Meltzer , and J. M. Trent , “Expression profiling using cDNA microarrays,” Nat. Genet. , 21 (1), 10 –14 (1999). Google Scholar

15.

M. K. Kerr and G. A. Churchill , “Statistical design and the analysis of gene expression microarray data,” Genet. Res. , 77 (2), 123 –128 (2001). Google Scholar

16.

F. W. D. Rost, Fluorescence Microscopy, Cambridge University Press, Cambridge (1995).

17.

M. B. Eisen , P. T. Spellman , P. O. Brown , and D. Botstein , “Cluster analysis and display of genome-wide expression patterns,” Proc. Natl. Acad. Sci. U.S.A. , 95 14863 –14868 (1998). Google Scholar

18.

A. Ben-Dor , R. Shamir , and Z. Yakhini , “Clustering gene expression patterns,” J. Comput. Biol. , 6 (3/4), 281 –297 (1999). Google Scholar

19.

P. Tamayo , D. Slonim , J. Mesirov , Q. Zhu , S. Kitareewan , E. Dmitrovsky , E. S. Lander , and T. R. Golub , “Interpreting pattern of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation,” Proc. Natl. Acad. Sci. U.S.A. , 96 (6), 2907 –2912 (1999). Google Scholar

20.

S. Kim , E. R. Dougherty , M. L. Bittner , Y. Chen , K. Sivakumar , P. Meltzer , and J. M. Trent , “A general framework for the analysis of multivariate gene interaction via expression arrays,” J. Biomed. Opt. , 5 (4), 411 –424 (2000). Google Scholar

21.

S. Kim , E. R. Dougherty , Y. Chen , K. Sivakumar , P. Meltzer , J. M. Trent , and M. Bittner , “Multivariate measurement of gene-expression relationships,” Genomics , 67 201 –209 (2000). Google Scholar

22.

D. Stoyan, W. S. Kendall, and J. Mecke, Stochastic Geometry and Its Applications, Wiley, Chichester (1995).

23.

Advances in Theory and Applications of Random Sets, D. Jeulin, Ed., World Scientific, New York (1997).

24.

E. R. Dougherty, Random Processes for Image and Signal Processing, SPIE, Bellingham, WA (1999).

25.

R. P. Loce and E. R. Dougherty, Enhancement and Restoration of Digital Documents, SPIE, Bellingham, WA (1997).

26.

J. O. Bishop , J. G. Morton et al.;, “Three abundance classes in Hela cell messenger RNA,” Nature (London) , 250 (463), 199 –240 (1974). Google Scholar

27.

http://arrayanalysis.nih.gov/resources/pub_download/jbo3_supplement.htm

Citation Download Citation

Yoganand Balagurunathan, Edward R. Dougherty, Yidong Chen, Michael L. Bittner, and Jeffrey M. Trent "Simulation of cDNA microarrays via a parameterized random signal model," Journal of Biomedical Optics 7(3), (1 July 2002). https://doi.org/10.1117/1.1486246

Published: 1 July 2002

Access the abstract

JOURNAL ARTICLE
17 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

CITATIONS

Cited by 53 scholarly publications.

Explore citations on Lens.org

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Interference (communication)

Signal to noise ratio

Image processing

Signal processing

Statistical analysis

Target detection

Detection and tracking algorithms

1.

Introduction

2.