Abstract

Missing values are prevalent in microarray data, they course negative influence on downstream microarray analyses, and thus they should be estimated from known values. We propose a BPCA-iLLS method, which is an integration of two commonly used missing value estimation methods—Bayesian principal component analysis (BPCA) and local least squares (LLS). The inferior row-average procedure in LLS is replaced with BPCA, and the least squares method is put into an iterative framework. Comparative result shows that the proposed method has obtained the highest estimation accuracy across all missing rates on different types of testing datasets.

1. Introduction

Data generated from DNA microarray data is useful for various biological applications; the data is in the form a large matrices. Generally, a row in a matrix represents a gene, and a column represents an experimental condition. But as large matrices, the data often suffer from missing values due to technical reasons such as spotting problems and background noise [1]. However, downstream analyses always need full matrices as input; thus these missing values should be estimated from existing values. Various methods to estimate missing values in microarray data have been proposed in the past decades. Generally, methods to estimate missing values can be divided into four categories [2]: (i) global based methods, (ii) local based methods, (iii) hybrid methods, and (iv) knowledge-based methods. Singular value decomposition (SVD) [3] and Bayesian principal component analysis (BPCA) [4] are two major global based approaches. SVD estimates the missing value in gene by first regressing this gene against eigengenes and use the coefficients of the regression to reconstruct from a linear combination of the eigengenes. BPCA estimates the target gene (i.e., a gene that contains missing values) by a linear combination of principal axis vectors, where the parameters are identified by a Bayesian estimation method. Local based category includes some classical and newly proposed methods. The most well-studied local based method is local least squares (LLS) [5]. LLS uses a multiple regression model to estimate the missing values from nearest neighbor genes of the target gene. Most recently proposed local methods are based on LLS, including iterated Local Least Squares (iLLS), weighted local least squares (wLLS) and iterative bicluster-based least squares (bi-iLS). Hybrid methods aim to capture both global and local correlations in the data. LinCmb [6] and EMDI [7] are two typical hybrid methods which estimate the missing values by a combination of other estimation methods from global approaches and local approaches. In the knowledge-based category, domain biological knowledge or external information is integrated into the estimation process.

Among all kinds of microarray missing value estimation methods, BPCA and local least squares (LLS) are two most widely used approaches. The former is based on the global structure of the matrix, and the latter is based on local similarity of the matrix. According to a survey [8] about different microarray missing value estimation methods, BPCA performs better than LLS on datasets with lower complexity, whereas due to another survey [9], LLS is superior than BPCA in the presence of data with dominant local similarity structures. This phenomenon inspires us to integrate the two methods, with the hope of improving the estimation accuracy and robustness. The idea of iterated local least squares again inspired us to put the integrated method into an iterative framework, which will further improve the estimation accuracy. We will give a brief review of BPCA and LLS in Section 2, the new method will be described in Section 3, comparative test of the proposed method with LLS and BPCA will be given Section 4, and a conclusion is drawn in Section 5.

2. Brief Review of BPCA and LLS

2.1. Bayesian Principal Component Analysis

Bayesian methods have been widely used in many fields such as face recognition and decision making [1013], and it also has successful application in microarray missing value estimation. Bayesian principal component analysis (BPCA) represents the -dimensional microarray expression vectors as a linear combination of () principal axis vectors (): where the coefficient is called a factor score and denotes the residual error. The principal axis vectors are obtained by computing the eigenvalues and eigenvectors of the covariance matrix of the dataset . As there are missing values in the original matrix , the principal axis vectors are separated into two parts as , corresponding to the observed part and missing part, respectively. Factor scores are obtained by minimizing the residual error of the observed part: Equation (2) is a least squares problem which can be solved easily in BPCA. By using the factor scores and , the missing part of the dataset is estimated as

In BPCA, the factor scores and the residual error in (1) are assumed to obey normal distributions; BPCA utilizes a probabilistic PCA (PPCA) model [14] to estimate parameters in the normal distribution. The parameter , along with another two parameters and in the normal distribution, forms a parameter set . BPCA introduces a Bayesian estimation method for the PPCA model, where the posterior distributions of and are estimated by a variational Bayes algorithm [15] simultaneously.

2.2. Local Least Squares

Local least squares (LLS) uses the linear correlation of the target gene and its nearest neighbors to recover unknown entries in the target gene. To explain how LLS works, we take an microarray matrix as an example. Assuming that gene has missing values, take and its nearest neighbors as a column vector, where in finding the nearest neighbors, the measurement can be -norm distance or Pearson’s correlation; then, rewrite the vector as (4):

In (4), is the vector of unknown entries of the target gene and is the vector of known entries of the target gene. and are the neighbors’ corresponding columns with and , respectively. A linear coefficient vector is established as a least squares problem with and :

Then the unknown entries of the target gene can be reconstructed by a linear combination of and : where is the pseudoinverse of . Repeat the procedure for all rows that have missing values and the full matrix can be recovered.

To estimate a proper value in finding nearest neighbors, LLS [5] provides a method like this. First, erase a certain number of known entries as missing values. Then, estimate the artificial missing matrix by using different neighbors by LLS. At last, compare these estimated matrices with the actual matrix; the value corresponding to the highest accuracy is chosen to be the optimal parameter.

3. BPCA-iLLS

Note that in LLS, in order to find nearest neighbors and to estimate an optimal value, a complete matrix is needed. However, in many cases, almost all rows in a microarray matrix contain missing values, which makes the distances between the target gene and other genes unable to be measured. To solve this problem, LLS [5] fills all missing values in the target gene by the row’s average value first. But in our experiment, we found that row-average cannot reflect the real structure of the dataset. Because row-average only uses the information of an individual row, the missing values in a target gene do not only rely on the known values in its own row. In the proposed BPCA-iLLS method, we replace the row-average procedure in LLS with BPCA. The flowchart of the proposed method is shown in Figure 1.

First, the input incomplete matrix is estimated by BPCA, to get a complete matrix. Next, this complete matrix is used as a temporary matrix for a further LLS procedure. In the LLS procedure, the optimal value is estimated on this temporary matrix, and this value is used to find matrices and . Subsequently, the missing values in every target gene are estimated by matrix and the coefficient vector . LLS is put into an iterative framework in the proposed method; that is, the estimated values by LLS are reused to form the temporary matrix in every iteration, and matrices and are refined in every iteration. It can be seen from the flowchart that the temporary matrices are different in each iteration. The initial temporary matrix is estimated by BPCA; following that, this matrix turns into the complete matrix that is estimated by LLS in each iteration. It should be mentioned that if the number of complete rows in the original incomplete matrix exceeds a preset threshold (e.g., 400 in LLS [5]), only complete rows are used to form the initial temporary matrix, which will highlight the original information of the matrix. This phenomenon happens only when the missing rates are low (typically below 5%). In most cases, the initial temporary matrices are BPCA-estimated ones in our proposed method. By replacing the row-average procedure in LLS by BPCA, and refining the temporary matrix in each iteration, the proposed method has the advantage over LLS and BPCA to be more robust on all kinds of datasets and has the ability to reduce the estimation error.

4. Comparative Result

4.1. Methods and Evaluation

We compare the proposed BPCA-iLLS method with BPCA and LLS. The only parameter of BPCA (number of principal axis vectors) is set to its default value, and the only parameter of LLS (number of neighbor genes) is learned by its heuristic method. For the proposed method, the number of iterations is a new parameter, and in our experiments, we set this parameter to be 5 because the estimation results do not change much after 5 iterations.

The accuracy is evaluated by normalized root mean square error (NRMSE): where is the real value, is the estimated value, and is the standard deviation for the actual values of the missing entries. A smaller NRMSE represents a higher accuracy. The same evaluation criterion was also used in LLS, BPCA, and a survey of different missing value estimation methods [9].

4.2. Datasets

Three types of datasets are tested for the proposed method, they are time series data (TS), non-time-series data (NTS), and mixed data (MIX). Table 1 shows details of the testing datasets. Here, CDC15_28 is the same time series data as what was used in survey [9]; SP_ALPHA was also used in [5] to test the performance of LLS. NCI60 and Yoshi come from the non-time-series data and mixed data in survey [9], respectively.

All original datasets contain missing values. To compute the estimation error rates, only complete rows of these datasets are used. A number of entries are randomly removed from the complete part to get artificial missing values in different missing rates. As the real values of these entries are actually known, the error rates can be calculated following (7). The same testing method was also employed in BPCA, LLS, and surveys [2, 8, 9].

4.3. Experimental Result

We estimate different rates of simulated missing values on the abovementioned datasets by three comparative methods: LLS, BPCA, and BPCA-iLLS, and calculate NRMSE following (7). Figures 2(a), 2(b), 2(c), and 2(d) provide the NRMSE across different missing rates for the three comparative methods on datasets CDC15_28, SP_ALPHA, NCI60, and Yoshi, respectively. Every NRMSE is a mean value of five independent experiments.

It can be seen from Figure 2 that on all the four testing datasets, BPCA-iLLS obtains the lowest NRMSE across all missing rates. LLS outperforms BPCA on datasets CDC15_28 and NCI60, and BPCA outperforms LLS on dataset SP_ALPHA; this reveals that the two methods are complementary with each other. As an integration of the two methods, BPCA-iLLS shows its robustness on different datasets.

Table 2 shows the computational time of different methods on dataset CDC15_28. The time is obtained from running experiments by Matlab R2011b on an ordinary 64 bit Windows 7 computer with 3.4 GHz quad-core processor and 16 GB internal memory. Intuitively, as an integration of two methods, BPCA-iLLS requires more computational time. It can be seen from Table 2 that the computational time of BPCA-iLLS is indeed longer than that of BPCA and LLS. However the increment of time is within a limited scope. Considering its estimation accuracy, the increment of computational time is acceptable.

5. Conclusion

Microarray missing value estimation is an important procedure in biology experiments. As two widely used missing value estimation methods, Bayesian principal component analysis (BPCA) and local least squares (LLS) take advantage of the matrix’s global structure and local structure, respectively; these two methods are complementary with each other. The proposed BPCA-iLLS method is an integration of BPCA and LLS, which fully exploits the global structure and local structure of the microarray matrix simultaneously, and the iterative scheme also helps to reduce the estimation error. Experimental results show that BPCA-iLLS has obtained the lowest normalized root mean square error (NRMSE) across all missing rates on all the testing datasets within an acceptable computational time. The performance of BPCA-iLLS also reveals the effectiveness of the integration of both global and local correlations of the microarray data, and such integration is one possible future direction of this field.