Joint Lp-Norm and L2,1-Norm Constrained Graph Laplacian PCA for Robust Tumor Sample Clustering and Gene Network Module Discovery

Kong, Xiang-Zhen; Song, Yu; Liu, Jin-Xing; Zheng, Chun-Hou; Yuan, Sha-Sha; Wang, Juan; Dai, Ling-Yun

doi:10.3389/fgene.2021.621317

METHODS article

Front. Genet., 23 February 2021

Sec. Computational Genomics

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.621317

This article is part of the Research Topic Computational Learning Models and Methods Driven by Omics for Biology for “The Fifth China Computer Society Bioinformatics Conference” View all 15 articles

Joint Lp-Norm and L_2,1-Norm Constrained Graph Laplacian PCA for Robust Tumor Sample Clustering and Gene Network Module Discovery

$\r\nXiang-Zhen Kong$ Xiang-Zhen Kong

Sha-Sha Yuan

Ling-Yun Dai

School of Computer Science, Qufu Normal University, Rizhao, China

The dimensionality reduction method accompanied by different norm constraints plays an important role in mining useful information from large-scale gene expression data. In this article, a novel method named Lp-norm and L_2,1-norm constrained graph Laplacian principal component analysis (PL21GPCA) based on traditional principal component analysis (PCA) is proposed for robust tumor sample clustering and gene network module discovery. Three aspects are highlighted in the PL21GPCA method. First, to degrade the high sensitivity to outliers and noise, the non-convex proximal Lp-norm (0 < p < 1)constraint is applied on the loss function. Second, to enhance the sparsity of gene expression in cancer samples, the L₂,₁-norm constraint is used on one of the regularization terms. Third, to retain the geometric structure of the data, we introduce the graph Laplacian regularization item to the PL21GPCA optimization model. Extensive experiments on five gene expression datasets, including one benchmark dataset, two single-cancer datasets from The Cancer Genome Atlas (TCGA), and two integrated datasets of multiple cancers from TCGA, are performed to validate the effectiveness of our method. The experimental results demonstrate that the PL21GPCA method performs better than many other methods in terms of tumor sample clustering. Additionally, this method is used to discover the gene network modules for the purpose of finding key genes that may be associated with some cancers.

Introduction

High-throughput sequencing technologies, including genome-wide measurements, have enabled large-scale gene expression profiles to accumulate faster (Goodwin et al., 2016). It is of great significance to obtain useful information from these data. Reliable and precise identification of cancer types and obtaining key pathogenic genes are very important for cancer diagnosis and treatment (Koboldt et al., 2012). Generally, gene expression data have a typical characteristic of “high dimension, low sample” size (West, 2003), which is a challenge for most traditional statistical methods. Too many variables and some uncorrelated noise variables in the gene expression data may all have a negative effect on the tumor clustering performance regardless of whether supervised or unsupervised clustering methods are used. Despite these problems, many researchers have demonstrated the effectiveness of tumor-type identification and feature selection by leveraging many machine learning algorithms (Hochreiter et al., 2010; Lee et al., 2010; Liu J. X. et al., 2015; Bunte et al., 2016; Kong et al., 2017; Wang et al., 2017; Chen et al., 2019). Among them, algorithms based on principal component analysis (PCA) (Collins, 2002; Jolliffe, 2002) have been widely used to process gene expression data successfully (Liu et al., 2013; Liu J. X. et al., 2015; Wang et al., 2017; Feng et al., 2019) for dimension reduction and denoising. However, PCA-based algorithms, including sparse principal component analysis (SPCA) (Zou et al., 2006; Shen and Huang, 2008; Journee et al., 2010; Liu et al., 2016; Feng et al., 2019) and robust principal component analysis (RPCA) (Candès et al., 2009; Liu et al., 2013; Liu J. X. et al., 2015; Wang et al., 2017), mainly deal with data that lie in a linear data manifold (Jiang et al., 2013). Many methods that can handle data lying in a non-linear manifold have been proposed, such as locality preserving projections (LPP) (He et al., 2005), locally linear embedding (LLE) (Roweis and Saul, 2000), local tangent space alignment (Zhang and Zha, 2002), Laplacian eigenmap (LE) (Belkin and Niyogi, 2002, 2003; Spielman, 2007) and latent variable model (LELVM) (Keyhanian and Nasersharif, 2015). The purpose of these non-linear dimensionality reduction techniques is to find a representation of points (samples) in a low-dimensional space, in which all points (samples) still maintain the similarity in the original high-dimensional space.

In recent years, optimization models that combine linear and non-linear dimensionality reduction methods, especially graph Laplacian embedding, have demonstrated their effectiveness. Liu et al. (2017) constructed a graph Laplacian matrix for semisupervised feature extraction. Cai et al. (2011) proposed a method named graph regularized non-negative matrix factorization (GNMF), which combined graph structure and non-negative matrix factorization for an improved compact representation of the original data. Jiang et al. (2013) developed graph-Laplacian PCA (gLPCA), which sought a low-dimensional representation of image data with significant improvement in clustering and image reconstruction by incorporating graph structures and PCA. Feng et al. (2017) employed pgLPCA based on graph Laplacian regularization and Lp-norm for feature selection and tumor clustering. Wang et al. (2019a) used Laplacian regularized low-rank representation (LLRR), which considers the intrinsic geometric structure of gene expression data to cluster the tumor samples. In addition, many methods benefit from norm constraints. For example, Journee et al. (2010) employed the L₀-norm constraint based on PCA to stress the sparse expression of genes in samples. The L₁-norm (Tibshirani, 1996) was introduced as the regularization function in sparse singular value decomposition (SSVD) (Lee et al., 2010; Kong et al., 2017) and the mix-norm optimization model proposed by Wang et al. (2019b). Feng et al. (2016) employed the L_1/2-norm constraint in their model to select characteristic genes. However, there remain some facets to be improved: for example, the robustness of the algorithm should be enhanced further, and the sparse representation of the data should be highlighted. For these purposes, the Lp-norm (Chartrand, 2012; Nie et al., 2013; Feng et al., 2017; Kong et al., 2017) constraint was used in the optimization model to degrade the sensitivity of outliers of the data. The L_2,1-norm (Xiang et al., 2012; Yang et al., 2012) constraint was used by Liu et al. (2017) and Wang et al. (2019b) to generate the row sparsity.

Motivated by the literature mentioned above, especially (Tibshirani, 1996; Chartrand, 2012; Xiang et al., 2012; Nie et al., 2013; Feng et al., 2017; Kong et al., 2017), we propose a new method named PL21GPCA incorporating traditional PCA, graph Laplacian embedding and different norm constraints on the loss function and the regularization function for robust tumor sample clustering and gene network module discovery. Five gene expression datasets, including one benchmark dataset, two single-cancer datasets from The Cancer Genome Atlas (TCGA), and two integrated datasets of multiple cancers from TCGA, are used to evaluate the effectiveness of our method. The experimental results demonstrate that the PL21GPCA method outperforms many existing methods in terms of tumor sample clustering. Additionally, this method is employed to discover gene network modules to identify the key genes with close relationships to some cancers.

We organize the rest of this paper as follows. Section “Related Works” provides the related works containing the non-convex proximal Lp-norm, L_2,1-norm and graph regularized PCA. The optimization model of PL21GPCA is presented, and the solution procedure is detailed in section “Methodology.” Section “Experiments and Discussion” presents the parameter selections, experimental results and some discussions. The tumor sample clustering and gene network analysis are also described in this section. In Section “Conclusion and Suggestions,” we present the conclusion for this article and propose some suggestions for future research.

Related Works

Definitions of the Proximal Lp-Norm and L_2,1-Norm

Let _X∈R^p×n be a data matrix, the proximal Lp-norm of _X is defined as follows:

∥ X ∥_{p} = {(\sum_{i}^{p} \sum_{j}^{n} | x_{i j} |^{p})}^{\frac{1}{p}} (0 < p < 1) (1)

The Lp-norm with 0 < p < 1 is a function with three typical characteristics: globally non-differentiable, non-convex, and non-smooth (Chartrand, 2012; Zhang et al., 2015). Many researchers have made suggestions to deal with Lp-norm (0 < p < 1) minimization (Chartrand, 2012; Guo et al., 2013; Qin et al., 2013). Since Lp-norm minimization can result in a sparser solution than the L₁-norm and perform better in terms of robustness to outliers than the L₂-norm in a sense, we use it to constrain the loss function of the PL21GPCA optimization model. The generalized shrinkage operation proposed by Chartrand (2012) is adopted to solve the function effectively in our method.

The L_2,1-norm of matrix _X is as follows:

{∥ X ∥}_{2, 1} = \sum_{i = 1}^{p} \sqrt{\sum_{j = 1}^{n} x_{i j}^{2}} = \sum_{i = 1}^{p} {∥ x_{i} ∥}_{2} (2)

where x_i (corresponding to feature i) is the _ith row of matrix _X. Yang et al. (2012) provided an intuitive explanation of the L_2,1-norm in the literature. To solve the L_2,1-norm, we can compute the L₂-norm of each row of _X first, record it as a vector b(X) = (∥x₁∥₂,∥x₂∥₂,…,∥x_p∥₂), and then compute the L₁-norm of vector b(X). The components of vector b indicate the importance of each feature. The L_2,1-norm favors obtaining a small number of non-zero rows in matrix _X, and then feature selection will be achieved.

PCA and Graph Laplacian Embedding

Principal Component Analysis (PCA)

Let X = (x₁,⋯,x_n) ∈ R^p×n (p≫n) be a matrix whose rows represent genes and columns represent samples. PCA is usually used to find the optimal principal directions V^T = (v₁,⋯,v_n) ∈ R^k×n (V^TV = I) that define the low-dimensional (k-dim) subspace. And the projected data points in the low subspace Vcan be denoted as the elements of the matrix U_p×k = (u₁,⋯,u_k) ∈ R^p×k. The traditional PCA finds Uand V with the squared Frobenius norm:

\underset{U, V}{arg} \min {∥ {X-UV}^{T} ∥}_{F}^{2} s . t . V^{T} V = I (3)

In our optimization model, the proximal Lp-norm ∥g∥_p (0 < p < 1) (Chartrand, 2012; Nie et al., 2013; Feng et al., 2017) is used instead of the traditional quadratic loss function ∥g∥_F to reduce the influence of outliers and noise. PCA naturally relates closely to the classic clustering means known as K-means (Ding and He, 2004). The optimal principal components contained in matrix V provide the solution of the K-means clustering method. It inspired us to combine PCA with Laplacian embedding, whose principal purpose is also clustering.

Graph Laplacian Embedding

Principal component analysis can find an approximate set of basis vectors in the case where data usually lie in a linear manifold (Jiang et al., 2013). In consideration of the local invariance of the intrinsic geometric structure of the data distribution, graph Laplacian embedding is a popular method among recent studies in non-linear manifold learning theory (Belkin and Niyogi, 2002, 2003; Spielman, 2007). The assumption of local invariance is that if two points (samples) are close in the intrinsic geometry of the original data distribution, the representations of these two points (samples) in the new coordinate are also close to each other. The local geometric structure can be modeled through a nearest neighbor graph on a scatter of data points. Given the data matrix X = (x₁,⋯,x_n) ∈ R^p×n, x_i(i = 1,⋯,n)can be regarded as one data point (one vertex in the graph). For each data point x_i, we find its k′ nearest neighbors and put edges between x_i and its neighbors. Then, a graph with _n vertices can be constructed, on which the weight matrix _W∈R^n×n is defined. w_ij is the weight between vertices x_i and x_j, it is used to measure the closeness of two points x_i and x_j, and it is a symmetric similarity matrix. There are three popular choices defining the weight matrix on the graph: heat kernel weighting, 0–1 weighting, and dot-product weighting. If nodes _i and _j are connected, using heat kernel weighting, $w_{i j} = e^{\frac{{∥ x_{i} - x_{j} ∥}^{2}}{σ}}$ , w_ij = 1 using 0–1 weighting and $w_{i j} = w_{i}^{T} w_{j}$ using dot-product weighting. The different similarity measures are suitable for different situations. Detailed information about the different weighting schemes can be found in the literature (Cai et al., 2011).

Let Z^T = (z₁,z₂,⋯,z_n) ∈ R^k×n represent the _ndata points in the k-dim embedding coordinates Z^T = (v₁,⋯,v_n) ∈ R^k×n (Z^TV = I), i.e., the representation of _{x_i} in the new low-dimensional basis is z_i = [v_i1,⋯,v_ik]. The “dissimilarity” of the two data points in the low basis can be measured by the Euclidean distance or the divergence distance. The Euclidean distance is adopted in our method. Define the “dissimilarity” of the two points in the low basis as d(z_i,z_j) = ∥z_i−z_j∥², combined with the weight matrix _W, and the smoothness of the low-dimensional representation can be measured by minimizing:

S = \frac{1}{2} \sum_{i, j = 1}^{n} {∥ z_{i} - z_{j} ∥}^{2} w_{i j}

= \sum_{i = 1}^{n} z_{i}^{T} z_{i} D_{i i} - \sum_{i, j = 1}^{n} z_{i}^{T} z_{j} w_{i j}

= T r (V^{T} D V) - T r (V^{T} W V) = T r (V^{T} L V) (4)

where _Tr(•)is the trace of a matrix, _{D=diag(d_1,⋯,d_n)}is a diagonal matrix, and $_{d_{i} = \sum_{j = 1}^{n} w_{i j}}$ . We call the _L=D–W the Laplacian matrix (Spielman, 2007).

Methodology

The PL21GPCA procedure is presented in this section. Figure 1 illustrates our general research framework. In brief, our work includes two steps. The first is obtaining the optimal projected matrix U_p×k and the principal directions matrix V_k×n via PL21GPCA. The second is to evaluate the validity of PL21GPCA. In this step, based on the principal directions matrix V_k×n obtained by PL21GPCA, the classic clustering method K-means is employed for tumor sample clustering. According to the projected matrix U_p×k, the differentially expressed genes are selected to carry out gene network analysis to find the key genes with close relationships to some cancers.

FIGURE 1

Figure 1. The general schematic framework of the PL21GPCA methodology.

To summarize, three aspects are highlighted in our method:

(1) To reduce the influence of outliers and noise, the non-convex proximal Lp-norm ∥g∥_p(0 < p < 1) is used on the loss function, which could improve the robustness of the optimization model effectively compared with the other constraints.

(2) To enhance the sparsity of gene expression in cancer samples, the L_2,1-norm is used on the projected matrix U_{p× k}.

(3) To retain the intrinsic geometric structure of the data points (samples), the graph regularization item is recommended in the optimization model.

Assume the input matrix X = (x₁,⋯,x_n) ∈ R^p×n(p≫n), which denotes _p genes’ expression in _n samples. Our goal is to find the optimal low-dimensional (_k-dim) subspace denoted as Z^T = (v₁,⋯,v_n) ∈ R^k×n (Z^TV = I) and the projected matrix U_p×k = (u₁,⋯,u_k) ∈ R^p×k in the low subspace. The traditional PCA finds Uand V with the squared Frobenius norm in the solution. In our optimization model, the proximal Lp-norm ∥g∥_p(0 < p < 1) (Chartrand, 2012; Nie et al., 2013; Feng et al., 2017) replaces the traditional quadratic loss function ∥g∥_F to reduce the influence of outliers and noise. The L2,1-norm is used on one of the regularization terms to enhance the sparse gene expression in cancer samples. The graph Laplacian regularization item emphasizing the local invariance of the intrinsic geometric structure is recommended in the optimization model.

The objective function of this method is designed as follows:

\begin{matrix} \underset{U, V}{\arg \min} {{∥ {X-UV}^{T} ∥}_{p} + λ {∥ U ∥}_{2}^{1} + α T r (V^{T} L V)} \\ s . t . V^{T} V = I, 0 < p < 1, λ > 0, α > 0 \end{matrix} (5)

Clearly, the objective function is somewhat intractable because it is non-convex and non-smooth. We adopt the augmented Lagrangian multiplier (ALM) (Hestenes, 1969; Bertsekas, 1982; Spielman, 2007; Lin et al., 2010) to address this optimization problem. Researchers have proven that the ALM algorithm possesses Q-linear convergence properties under some conditions (Bertsekas, 1982).

When using the ALM method to obtain the optimal solution of (5), we replace X−UZ^T with _E. Eq. (5) can be equivalently written as:

\begin{matrix} \underset{E, U, V}{\arg \min} {{∥ E ∥}_{p} + λ {∥ U ∥}_{2}^{1} + α T r (V^{T} L V)} \\ s . t . E-X + U V^{T} = 0, V^{T} V = I \end{matrix} (6)

According to the ALM method, eq. (6) is equivalent to minimizing:

\begin{matrix} L_{μ, Y} (E, U, V) = {∥ E ∥}_{p} + \frac{μ}{2} {∥ E-X + U V^{T} + \frac{Y}{μ} ∥}_{F}^{2} \\ + λ {∥ U ∥}_{2}^{1} + α T r (V^{T} L V), \end{matrix} (7)

where _Y is the Lagrangian multiplier, and _μ is the step size of the update rule. In (7), there are three variables to be solved. The alternating direction method (ADM) (Gabay and Mercier, 1976) is adopted to tackle this thorny problem because the equation with only one variable is easily solved when the others are fixed. By this means, (7) naturally results in three subproblems.

Problem 1: When _U and _V are fixed, (7) is written as follows:

L_{μ, Y} (E, U, V) = {∥ E ∥}_{p} + \frac{μ}{2} {∥ E-X + U V^{T} + \frac{Y}{μ} ∥}_{F}^{2} (8)

where $0 < p < 1$ . Eq. (8) can be solved by the proximal shrink operator denoted as follows:

s h r i n k_{p} (t, δ) := \max {0, | t | - δ {| t |}^{p - 1}} \frac{t}{| t |} (9)

Let $t = X - U V^{T} - \frac{Y}{μ}$ , $δ = \frac{1}{μ}$ . Then, according to the shrinkage operation (soft thresholding) proposed by Chartrand (2012), _E is updated as:

E^{r + 1} = s h r i n k_{p} {X - U^{r} {(V^{t})}^{T} - \frac{Y^{r}}{μ^{r}}, \frac{1}{μ^{r}}} (10)

Problem 2: When _E and _V are fixed, (7) is simplified as follows:

L_{μ, Y} (E, U, V) = \frac{μ}{2} {∥ E-X + U V^{T} + \frac{Y}{μ} ∥}_{F}^{2} + λ {∥ U ∥}_{2}^{1} (11)

To simplify (11), let $H = X - E - \frac{Y}{μ}$ . Then, (11) is written as:

L_{μ, Y} (E, U, V) = \frac{μ}{2} {∥ U V^{T} - H ∥}_{F}^{2} + λ {∥ U ∥}_{2}^{1} (12)

The partial derivatives of _L with respect to _Uare:

\frac{\partial L}{\partial U} = μ (U V^{T} - H) V + 2 λ Q U (13)

where _Q∈R^p×p is a diagonal matrix with $q_{i, i} = \frac{1}{{∥ U_{(i, :)} ∥}_{2}} (i = 1, \dots, p)$ (Xiang et al., 2012). Letting (13) be equal to 0, the following update rule for _U is then obtained:

U^{r + 1} = {(I + \frac{2 λ}{μ^{r}} Q^{r})}^{- 1} H^{r} V^{r} (14)

To simplify (14), let $A^{r} = {(I + \frac{2 λ}{μ^{r}} Q^{r})}^{- 1}$ , and then (14) is written as:

U^{r + 1} = A^{r} H^{r} V^{r} (15)

Problem 3: When _E and Ware fixed, (7) is simplified as follows:

L_{μ, Y} (E, U, V) = \frac{μ}{2} {∥ E-X + U V^{T} + \frac{Y}{μ} ∥}_{F}^{2} + α T r (V^{T} L V) (16)

With respect to the settings $H = X - E - \frac{Y}{μ}$ , (16) can be written equivalently as:

\begin{matrix} L_{μ} (E, U, V) & = & \frac{μ}{2} {∥ U V^{T} - H ∥}_{F}^{2} + α T r (V^{T} L V) \\ = & \frac{μ}{2} T r ((U V^{T} - H) {(U V^{T} - H)}^{T}) + α T r (V^{T} L V) \end{matrix} (17)

Based on (17), _Vis found by minimizing:

V = \underset{V}{arg} \min T r (V^{T} (\frac{α}{μ} {L-H}^{T} A H) V) (18)

Therefore, _V^r+1 can be obtained as follows:

V^{r + 1} = (v_{1}, \dots, v_{k}) (19)

where (v₁,…,v_k) are the _keigenvectors corresponding to the smallest _k eigenvalues of the matrix $\frac{α}{μ} {L-H}^{T} A H$ . Thus, based on the ALM, ADM and the shrinkage operation, the solution to solve the optimization model described in (5) is shown in Algorithm 1. In the optimization model, there are six parameters _k, _p, _λ, _α, _ρ, _μ to be pre-determined, among them. As the parameters used to control the step size in the update rule of AML, we set μ = 10⁻² and ρ = 1.2 for all gene expression datasets experiments (Feng et al., 2016). The parameter _k is determined refering to the number of prior categories of each dataset. For the three essential parameters _p, _λ, _α, to be determined in (5), we choose them corresponding to different situations for the best clustering performance through extensive experiments. Different parameters are chosen for different datasets. Detailed parameter selections and discussions are described in section “Experiments and Discussion.”

ALGORITHM 1

Algorithm 1. The solution to optimized (5).

Experiments and Discussion

Gene Expression Datasets

Five gene expression datasets, which include one benchmark dataset, two single-cancer datasets from TCGA, and two integrated multicancer datasets from TCGA, are used to evaluate the performance of PL21GPCA. The verified experiments consist of two aspects: “tumor sample clustering” and “gene network module discovery.” Based on the optimal low-dimensional (_k-dim) subspace denoted as _{V^T =(v_1,⋯,v_n)∈R^k×n (V^T V=I)}, the classical clustering method K-means is then used for tumor clustering. For comparison, extensive experiments are also performed using existing dimensionality reduction methods, including SPCA (Journee et al., 2010), RPCA (Candès et al., 2009), gLPCA (Jiang et al., 2013), pgLPCA (Feng et al., 2017) and GNMF (Cai et al., 2011). Among the compared methods, some are based on PCA, and some introduce the graph Laplacian regularization item. Based on the optimal projected matrix U_p×k, the differentially expressed genes are selected for gene network analysis to find key genes with close relationships to some cancers.

The details of the five data sets are as follows. The benchmark gene expression dataset is lung cancer data (Bhattacharjee et al., 2001) that have often been employed by researchers to evaluate their algorithms (Lee et al., 2010; Kong et al., 2017), consisting of 12,625 genes of 56 samples. There are four types of lung cancer in the 56 samples: pulmonary carcinoid (20), colon metastases (13), small cell lung carcinoma samples (6) and normal lung samples (17). The two single-cancer datasets and the two integrated multicancer datasets are all from The Cancer Genome Atlas (TCGA) which is known as the largest tumor specimens database. The genomic data provided by TCGA include DNA methylation, microRNA expression, gene expression, protein expression, and DNA copy number, etc. We downloaded gene expression datasets (at level 3) of five different cancers from TCGA: colorectal cancer (CRC), cholangiocarcinoma (CHOL), squamous cell carcinoma of head and neck (HNSC), pancreatic cancer (PAAD), and esophageal cancer (ESCA). Each dataset consists of 20,502 genes expressed in different numbers of samples. In our experiments, CRC and CHOL are used as single-cancer datasets to evaluate the performance of the PL21GPCA method. There are 281 samples for CRC and 45 for CHOL. Each of these two datasets contains two types of cancer samples, “negative” and “positive.” “Negative” or “NT” represents normal samples. “Positive” or “TP” represents diseased samples. There are 262 “TP” samples in the CRC data and 36 in the CHOL data, and the rest are “NT” samples. Two integrated datasets are used to further verify the performance of the PL21GPCA method. Each integrated dataset consists of 3 types of cancers. One of the integrated datasets, H_C_P, contains 836 “TP” samples, among which the sample numbers of the three cancers are 398 (HNSC), 262 (CRC), and 176 (PAAD). The other integrated dataset, E_C_C, contains 481 “TP” samples, in which the sample numbers of the three cancers are 183 (ESCA), 36 (CHOL), and 262 (CRC). The statistics of these datasets are summarized in Table 1.

TABLE 1

Table 1. Statistical information on the experimental data.

Tumor Sample Clustering

Evaluation Metric

Based on the optimal principal directions Z^T = (v₁,⋯,v_n) ∈ R^k×n (Z^TV = I), the K-means algorithm is then employed for tumor sample clustering. The accuracy (ACC) and the normalized mutual information (NMI) are the two most commonly used metrics to evaluate the clustering results (Cai et al., 2005). For the _ith sample, we use p_i to denote the prior label and r_i to denote the obtained clustering label. The metric ACC is defined as follows:

A C C = \frac{\sum_{i = 1}^{n} θ (p_{i}, m a p (r_{i}))}{n}, (20)

where _n denotes the total number of samples in every dataset. The function θ(x,y) equals 1 if x=y and 0 otherwise. The function map(r_i) maps each obtained cluster label r_i to the equivalent prior label. Let _C be the prior set of clusters and C′ be the obtained set from our algorithm. Define their mutual information metric MI(C,C′) as:

M I (C, C^{'}) = \sum p (c_{i}, c_{j}^{'}) \log_{2} \frac{p (c_{i}, c_{j}^{'})}{p (c_{i}) \dots p (c_{j}^{'})} (21)

where p(c_i) and $p (c_{j}^{'})$ are the probabilities that a sample arbitrarily selected from the dataset belongs to clusters c_i and $c_{j}^{'}$ , respectively, and $p (c_{i}, c_{j}^{'})$ is the joint probability. In the experiments, the metric NMI is defined as follows:

N M I (C, C^{'}) = \frac{M I (C, C^{'})}{\max (H (C), H (C^{'})} (22)

where H(C) and H(C′)are the entropies of _C and _C’, respectively. Therefore, the metric NMI(C,C′) ranges from 0 to 1. NMI=1 if the two sets of clusters are identical, and if the two sets are independent, NMI=0.

However, a problem that needs to be resolved is that the K-means algorithm may or may not converge to the same solution in each run with random initial conditions. Therefore, the evaluated metrics ACC and NMI obtained by only once-running of k-means is not enough to explain the result. To solve this problem, for the given cluster number _k, K-means was run 50 times on each dataset, and the average performance was computed. As a reference, we also recorded the maximum values of ACC and NMI of the 50 runs. Thus, four metrics, ACC_max, ACC_mean, NMI_max and NIM_mean, are used to evaluate our experiments. Generally, the larger the mean value is, the better is the clustering performance, and the better are the stability and robustness of the clustering. This also indicates that the corresponding dimension reduction method has good robustness and sparse effect.

Parameter Selection

The PL21GPCA model has three essential parameters, _p, _λ, and _α, which need to be determined in (5). The range of each parameter is 0 < p < 1, λ > 0, α > 0. When determining the optimal value of one parameter, the other two parameters are fixed. We focus on the influence of the value of _p on the performance. PL21GPCA achieves consistently good performance when the two regularization parameters _λ and _α are varied from 10 to 1,000 on all three datasets. Figure 2 shows how the average performance varies when taking the essential parameter _p at nine different values from 0.1 to 0.9. For every dataset, extensive experiments are carried out to seek the appropriate parameters to achieve the best performance for tumor sample clustering. Thus, different parameters are chosen for different datasets (see Table 2).

FIGURE 2

Figure 2. The average performance taking the essential parameter at nine different values from 0.1 to 0.9. (A) The mean value of ACC for different cancer datasets. (B) The mean value of NMI for different cancer datasets.

TABLE 2

Table 2. Values of the three parameters _p, _λ, and _α for different datasets.

There is another parameter that is not appear in the objective function of PL21GPCA. However, it is also an important parameter affecting the performance of our method. It is parameter k′, the number of nearest neighbors of every point when constructing the graph in the step of graph Laplacian embedding. Setting this parameter too small may cause overfitting, and too large may increase the error. By extensive experiments, we find that the appropriate value for this parameter is near the square root of the sample number for different datasets.

Clustering Results

Tables 3–5 show the clustering results on the lung cancer data, single-cancer data from TCGA (CRC and CHOL datasets), and integrated cancer data (H_C_P and E_C_C datasets), comparing the PL21GPCA-based method with the competitors. For each dataset with a given cluster number _k, the K-means algorithm was run 50 times to randomize the experiments. The maximum and the mean value metrics are all presented in the tables. The performance of the PL21GPCA-based method is highlighted in bold in the tables. Regardless of the datasets, the PL21GPCA-based method always results in the best performance on the mean value metrics ACC_mean and NMI_mean. As mentioned above, the mean value is more meaningful than the maximum value, which is for reference only. By leveraging the power of three measures, including taking the proximal Lp-norm ∥g∥_p(0 < p < 1) on the loss function, employing the L_2,1-norm regularization item to insure feature selection, and introducing the Laplacian regularization item to emphasize the geometrical structure of the data, the PL21GPCA-based method can always get a better clustering performance.

TABLE 3

Table 3. Clustering performance on lung cancer.

TABLE 4

Table 4. Clustering performance on CRC and CHOL.

TABLE 5

Table 5. Clustering performance on H_C_P and E_C_C.

For the different types of data used in the experiments, a number of meaningful points need to be emphasized further.

The benchmark data

For the lung cancer dataset, Table 3 shows that the PL21GPCA-based method achieves the same performance as SPCA, RPCA and pgLPCA considering the maximum value metrics (the ACC_max and the NIM_max are also 100%) but is obviously superior to the other methods in terms of the mean value metric (ACC_mean reaches 96.82% and the NIM_mean reaches 93.44%).

Single-cancer data from TCGA

Table 4 shows the clustering performance of the two single-cancer datasets from TCGA. For the CRC dataset, our method presents very superior performance compared with other methods, with the ACC_mean reaching 99.64% as well as the ACC_max. The good average performance shows the robustness of the PL21GPCA method. In addition, the two NMI metrics (all reaching 90.55%) also go far beyond the performance of other methods. For the CHOL dataset, all the methods achieve the same results (100%) when considering the maximum value metrics. Our method achieves the same performance (100%) as GNMF and RPCA in terms of the mean value metrics. A surmise is reported that there may be distinct discriminations for the two kinds of samples in the original CHOL data (Kong et al., 2017).

Integrated multicancer data from TCGA

Table 5 reports the estimation results on the two integrated datasets. It shows that the PL21GPCA method performs much better than the competitors. As highlighted in bold in Table 5, for H_C_P data, the ACC_max and the ACC_mean all reach 96.41%, and the NMI_max and the NMI_mean are also superior to the corresponding values for other methods. For E_C_C data, our method is still outstanding; taking the ACC metric as an example, the ACC_max reaches 85.65%, and the ACC_mean reaches 84.09%. Based on the excellent performance on these two integrated datasets, should we speculate that the PL21GPCA method is more suitable for learning the compact representation of higher-dimensional and more complex data than its competitors, which needs further verification.

Finally, as we can see from Tables 3–5, among the compared methods, the RPCA method performs second to our method and better than the other competitors, such as SPCA, GNMF, gLPCA, and pgLPCA. The performance of RPCA is in italics in the tables. If the intrinsic geometric structure is introduced to RPCA, will the performance be improved further? This question is also worth further verification.

Embedding Evaluation

To further show the performance of the novel dimensionality reduction method compared others, a visualized data distribution of the low-dimensional embedding corresponding to the first two components of the PCA-based method are demonstrated. Besides the proposed method PL21GPCA, the results of three other methods including SPCA, gLPCA, pgLPCA are compared because these methods are also the direct extensions of PCA. Figure 3 presents the sample clustering results in a two-dimensional space. We choose two representative datasets CRC data and H-C-P data to show the results. Figures 3A–D are the results of the compared methods SPCA, glPCA, gpLPCA and PL21GPCA, respectively, on the CRC dataset. Figures 3E–H are the compared results of the four methods on the H-C-P dataset. No matter for the CRC data which contains two types of cancer samples, or for the H-C-P data which contains three types of cancer samples, SPCA and gLPCA make the samples from different categories being mixed together, and the pgLPCA can only separate the samples into categories roughly, so they have unideal clustering results. However, PL21GPCA make the embeddings of samples in clearer distribution. Therefore, the clustering results is better than the compared methods. The visualized results verified the robustness and the flexibility of the proposed model.

FIGURE 3

Figure 3. A visualized comparison of low-dimensional embeddings by SPCA, gLPCA, pgLPCA, and PL21GPCA on COAD and H-C-P datasets. (A–D) Are the results of the compared methods SPCA, glPCA, gpLPCA, and PL21GPCA respectively on the CRC dataset. (E–H) Are the compared results of the four methods on the H-C-P dataset.

Experiments on Simulated data

Experiments on simulation data are also carried out to evaluate the effectiveness of PL21GPCA. The simulation data used in the experiment is a matrix X_3000×80 generated by rand function in Matlab. In order to simulate the representation of features in different types of samples, based on the generated matrix X_3000×80, some changes have also been made. Firstly, we add 1 to the values of columns 1 to 20 in rows i^∗30−29(i = 1,⋯100) of matrix X_3000×80, add 2 to the values of columns 21 to 40 in rows i^∗30−19(i = 1,⋯100), add 3 to the values of columns 41 to 60 in rows i^∗30−9(i = 1,⋯100), add 4 to the values of columns 61 to 80 in rows i^∗30−5(i = 1,⋯100), add 2 to the values of columns 21 to 40 in rows i^∗30−25(i = 1,⋯100), add 1 to the values of columns 1 to 20 in rows i^∗30−15(i = 1,⋯100), which means that the 80 samples in the simulation data contain four categories. Secondly, we use the function imnoise in matlab to add different sizes of Gaussian white noise to X. The mean value of the added Gaussian white noise is 0 and the variance _σ² is chosen in the range of [0.4∼1.2]. Next, we use the proposed method PL21GPCA and the compared methods to reduce the dimension and denoise the simulated data, and then use the K-means method to cluster the denoised data, the evaluation metric ACC_mean mentioned above is used to test the effectiveness of the method. the K-means algorithm is run 50 times to randomize the experiments.

Table 6 shows the experiments results on simulated data. It can be seen evidently that the performances of all methods change with the increase of noise. The best performance of different methods when adding different noises are marked with black bold. Although the performance of pl21GPCA is second only to RPCA when the noise is low (σ² = 0.4), with the increase of Gaussian white noise, the effect of our proposed method is mostly ahead of other methods especially when σ² = 0.6, 0.8, 1.2, which shows that the new method has better de-noising ability and robustness.

TABLE 6

Table 6. Clustering performance on simulated data with different Gaussian white noise.

Gene Network Module Discovery

Due to the outstanding performance of our method on the CRC dataset and the integrated H_C_P dataset, the construction and analysis of the gene network are based on these two datasets. The strategy of gene network module discovery involves two steps. First, the genes for constructing the co-expression gene networks are selected. Second, based on the filtered genes, co-expression networks are established, and then the key genes that may be closely related to some cancers are analyzed.

Gene Selection

In this step, there are two problems to be solved: one is how to select genes, and the other is how many to select. It is known that among thousands of genes, only a handful of them regulate a specific biological process (Delbert et al., 2005; Liu et al., 2013). These minority of genes are called differentially expressed genes (Liu J. et al., 2015). In this article, the differentially expressed genes are selected to carry out gene network analysis according to the projected matrix U_p×k. Now, we mark the optimal projected matrix U_p×k as $\tilde{U}$ ; therefore, these differentially expressed genes can be identified according to $\tilde{U}$ (Liu J. et al., 2015; Feng et al., 2016). We denote $\tilde{U}$ as follows:

\tilde{U} = [\begin{matrix} {\tilde{u}}_{11} {\tilde{u}}_{12} \dots {\tilde{u}}_{1 k} \\ {\tilde{u}}_{21} {\tilde{u}}_{22} \dots {\tilde{u}}_{2 k} \\ ⋮ ⋮ ⋱ ⋮ \\ {\tilde{u}}_{p 1} {\tilde{u}}_{p 2} \dots {\tilde{u}}_{p k} \end{matrix}] (23)

The upregulated genes are reflected by the positive value in the matrix $\tilde{U}$ , and the downregulated genes are reflected by the positive value (Liu et al., 2013). Therefore, the absolute value of the items in $\tilde{U}$ is used to identify the differentially expressed genes. The items of each row in $\tilde{U}$ are summed, and then the evaluating vector denoted as $\hat{U}$ is obtained:

\hat{U} = {[\sum_{j = 1}^{k} {\tilde{u}}_{1 j} \sum_{j = 1}^{k} {\tilde{u}}_{2 j} \dots \sum_{j = 1}^{k} {\tilde{u}}_{p j}]}^{T} (24)

The larger item in $\hat{U}$ indicates the more strongly differentially expressed gene. Therefore, we sort the elements in $\hat{U}$ in descending order and take the top l(l≪p) elements. In many studies, it has been unclear how many genes should be selected for gene network analysis. Since only a small number of genes can regulate a specific biological process, these genes may play a decisive role in the clustering results of tumor samples. In this paper, the number of genes used for constructing the gene network is determined according to the clustering performance based on the selected genes. Through experimentally investigating the clustering performance with the number of selected genes varied from 500 to 2000, it is found that the clustering results corresponding to 1600 genes are best for the CRC data and 700 for the H_C_P data.

Construction of Gene Networks

Suppose l differentially expressed genes are used to construct the gene network. Let matrix R_l×n denote the l gene expression in _n samples. We use the Pearson correlation coefficient (PCC) (Hou et al., 2019) to measure the correlation of any two genes in R_l×n. The values in the PCC matrix vary in the range of [0,1]. The larger the PCC value is, the higher the correlation is. Based on matrix R_l×n, an adjacency matrix A_l×l can be calculated. According to the adjacency matrix, an intuitive visualized graph of the gene interaction network composed of several modules is obtained.

Analysis of Gene Network Modules

There are 39 modules, including 218 nodes and 504 edges, in the constructed network based on the CRC data. We analyzed the top 10 nodes (genes) with higher degrees in the first three modules that retained more relevant interactions. The degree of the node (gene) shows its role in the network modules. The larger the degree of the node (gene) is, the more important the node (gene) is, and such nodes (genes) may retain the tight connectivity of the network. Figure 4 shows the main part of the first three gene network modules in which a small number of nodes whose degree is very low have been removed. The roles of the top ten genes in the first three modules are illustrated in Figure 4. The degree value of a node in Figure 4 is represented by its size and color. The larger the node is, the darker its color is, which corresponds to a larger degree of the node. Referring to GeneCard with its website http://www.genecards.org/, we list the annotations of the top ten genes in Table 7. Five of the top ten genes have been validated as associated with multiple cancers: SPARC, ABCC12, COL6A3, LUM, and RPS3. The corresponding nodes of these genes are marked with a black outline in Figure 4 and are also shown in bold in Table 7. In the literature (Liu Q. Z. et al., 2015), the gene SPARC has been recommended as a predictor of colorectal cancer. The gene ABCC12 is a human ATP binding cassette (ABC) transporter and is a multidrug resistance protein (MRP9). However, MRP9 has been recognized as an important target for the immunotherapy of breast cancer (Bera et al., 2002). Studies have shown that colorectal cancer can be predicted by the gene COL6A3 because it is overexpressed in samples of colorectal cancer. Therefore, COL6A3 is considered a potential diagnostic and prognostic marker gene for colorectal cancer (Qiao et al., 2015). As one of the members of the leucine-rich proteoglycan family, the gene Lumican (LUM) is overexpressed in many kinds of cancers, including colorectal, neuroendocrine, cervical, carcinoid, breast, and pancreatic cancer. LUM also causes the growth and invasion of pancreatic cancer (Ishiwata et al., 2007). The ribosomal protein gene S3 (RPS3) is also overexpressed in colorectal cancer. Researchers found an increase in ribosome synthesis in patients with colorectal cancer (Pogue-Geile et al., 1991). Although the other five genes RPL32, TMEM59L, LOC642929, LHX2, and TLCD3B have not been identified in clinical studies indicating their effect on cancers, they may be considered candidate oncogenes because of their high ranking in our constructed gene network modules. By constructing co-expression gene network modules based on the CRC dataset, we found some disease-causing genes for colorectal cancer and other related cancers. It shows that constructing gene network modules via the genes filtered based on PL21GPCA can help us discover the key oncogenes.

FIGURE 4

Figure 4. The first three modules of the constructed network based on the CRC data. The five marked genes SPARC, ABCC12, COL6A3, LUM, and RPS3 have been confirmed to be associated with CRC and other cancers. (A) Module 1; (B) Module 2; (C) Module 3.

TABLE 7

Table 7. Annotations of the top ten genes in the first three network modules based on CRC data.

The constructed network based on the integrated data H_C_P includes 157 nodes and 644 edges. We analyzed the five important nodes (genes) with higher degrees in the first three modules that retained more relevant interactions. Figure 5 illustrates the main part of the first three gene network modules in which the nodes of very low degree have also been removed. Referring to GeneCards, their annotations are listed in Table 8. The five genes RPL32, EEF1G, SPRR1B, COL1A2, and MMP2 have been recognized to be related to multiple cancers. The corresponding nodes of these genes are marked with a black outline in Figure 5. Wan et al. (2004) conducted large-scale experiments on human liver cancer cells. Research has shown that RPL32 is one of the potential genes that affect human cell growth and cancer formation and provides an important tool for diagnostic markers and drug targets (Wan et al., 2004). EEF1G has been thought to be a characteristic gene for colorectal cancer; it is highly expressed in most colorectal cancers and could be considered a marker gene for colorectal cancer detection (Matassa et al., 2013). In addition, the expression level of EEF1G in pancreatic tumor cells was higher than that in normal cells (Lew et al., 1992). SPRR1B is overexpressed in human oral squamous cells. It has been experimentally proven that SPRR1B overexpression in cells will signal MAP kinases but inhibit MAP kinase signals, so SPRR1B can affect cell growth and maintenance (Michifuri et al., 2013). Kiyoshi Misawa and other researchers mainly studied the expression of COL1A2 in head and neck squamous cell carcinoma (HNSC) and found that hypermethylation of CpG may cause inactivation of the gene COL1A2. Therefore, the COL1A2 gene may affect the formation and development of HNSC and could become a major biomarker (Misawa et al., 2011). As a member of the matrix metalloproteinase (MMP) gene family. MMP2 is relevant to the generation of malignant tumors, including colorectal cancer, lung cancer, and breast cancer (Yu et al., 2002; Arajo et al., 2015; Ren et al., 2015). Analysis through the gene network constructed based on integrated multicancer data is helpful for mining the interrelationships between different cancers and genes. It may provide an important reference for the diagnosis and treatment of various diseases.

FIGURE 5

Figure 5. The first three modules of the constructed network based on the H_C_P data. The five marked genes RPL32, EEF1G, SPRR1B, COL1A2, and MMP2 have been confirmed to be associated with multiple cancers. (A) Module 1; (B) Module 2; (C) Module 3.

TABLE 8

Table 8. Annotations of the most important five genes in the first three network modules based on H_C_P data.

Conclusion and Suggestions

In this article, we propose a new dimensionality reduction method named PL21GPCA based on PCA for robust tumor sample clustering and gene network module discovery. Based on the traditional PCA, the non-convex proximal Lp-norm ∥g∥_p(0 < p < 1)is applied on the loss function to decrease the sensitivity to outliers and noise. The L₂,₁-norm is used on the projected matrix to enhance the sparse gene expression in cancer samples. The graph regularization item is introduced to the optimization model to retain the geometric structure of the data. Five gene expression datasets, including one benchmark dataset, two higher-dimensional single-cancer datasets from TCGA, and two integrated multicancer datasets from TCGA, are used to evaluate the performance of our method. The compared experiments demonstrate that the PL21GPCA method outperforms many existing methods in terms of tumor sample clustering. Moreover, this method is employed to discover gene network modules to find the key genes with close relationships to cancers. The results of our study may be a useful reference for clinical diagnosis.

There are some suggestions for future research. First, in the optimization model of PL21GPCA, the constraint used on the loss function is the non-convex proximal Lp-norm ∥g∥_p(0 < p < 1), since Lp-norm minimization can result in a sparser solution than the L₁-norm and perform better in terms of robustness to outliers than the L₂-norm. However, in addition to the generalized shrinkage operation proposed by Chartrand (2012), there are some other suggestions to address the Lp-norm (0 < p < 1) minimization (Guo et al., 2013; Qin et al., 2013) problems. Therefore, we will continue to explore other solutions to the optimization model with the Lp-norm ∥g∥_p(0 < p < 1). Second, we will evaluate the performance of PL21GPCA as a compact representation method combined with other methods, including supervised and unsupervised clustering methods such as spectral clustering, support vector machine (SVM) or their improved versions. Third, as mentioned above, the PL21GPCA method gets especially outstanding performance for processing the integrated data, so we will use the PL21GPCA method to process many other integrated data to verify its performance further.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: The lung cancer data http://www.unc.edu/∼haipeng. The TCGA data http://www.tcga.org/.

Author Contributions

X-ZK conceived and designed the experiments. YS and X-ZK performed the experiments and contributed to the writing of the manuscript. S-SY, JW, and L-YD analyzed the data. J-XL and C-HZ contributed to reagents, materials, and analysis tools. All authors contributed to the article and approved the submitted version.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

Thanks a lot for my co-tutor Yong Xu who is now a professor in Harbin Institute of Technology, Shenzhen, China.

References

Arajo, R. F., Lira, G. A., Vilaa, J. A., Guedes, H. G., Leito, M. C. A., Lucena, H. F., et al. (2015). Prognostic and diagnostic implications of MMP-2, MMP-9, and VEGF-a expressions in colorectal cancer. Pathol. Res. Practice 211, 71–77. doi: 10.1016/j.prp.2014.09.007