Introduction

lncRNA refers to long non-coding RNAs (lncRNAs) with a length of more than 200 nucleotides. In the past, it was thought that lncRNAs had little effect on gene expression1. However, in recent years, studies have shown that lncRNAs are closely related to various human diseases, which has triggered a research upsurge in bioinformatics on the association between lncRNAs and diseases2. Studies have shown that lncRNAs are involved in diseases through abnormal sequence3 and spatial structure4, abnormal expression level5 and abnormal interaction with binding proteins6, thus affecting human health, including diabetes7, cardiovascular disease8, and various types of cancer9. With the development of computer, big data technology is gradually mature. The application of artificial intelligence technology in the research of associations between lncRNA and diseases can accelerate the discovery of the potent association between lncRNA and diseases, improve the accuracy of biological experiments, and reduce the efforts of bioinformatics researchers and the cost of biological experiments. In medicine, the association between lncRNA-diseases can help doctors improve the detection of early diseases and targeted treatment of some diseases10; in biology, the association between lncRNA-diseases can help researchers systematically understand the pathogen nature of complex diseases11. Therefore, it is necessary to analyze the existing data through big data technology and establish a prediction model to predict the association between lncRNA-diseases.

At present, lncRNA-disease association prediction model can be roughly divided into two parts. Part of it is based on single association data. For example, Chen et al. proposed a new lncRNA-disease prediction method (LRLSSP)12 based on Laplacian regularized least squares and spatial projection. Firstly, by integrating the above information and Gaussian kernel similarity to make up for the lack of semantic similarity of disease, an accurate lncRNA-disease similarity network was reconstructed, and then Laplacian regularized least squares method was used Small two multiplication is used to estimate the association between lncRNA-diseases and solve the problem of lncRNA-disease sparsity. However, this model has some disadvantages, such as requiring a large number of combined data, and relying too much on the known lncRNA-disease association data; in view of Chen et al.’s problem, the models established by the following scholars do not need to rely on Xie et al. proposed a novel prediction method of human lncRNA-disease Association (NCPHLDA)13 based on network consistent projection. The model integrates the above information, including lncRNA cosine similarity network and disease cosine similarity network. NCPHLDA has no requirement for parameters and has good prediction performance. However, there are some limitations. If the known lncRNA-disease correlation is small, the prediction results will be biased. In order to solve the problem of insufficient data set of lncRNA-disease association, Zhang et al. constructed a prediction model of lncRNA-disease association based on comprehensive spatial projection fraction (LDAI-ISPS)14. In addition, Li et al. proposes a new network consistency prediction lncRNA-disease association model (NCPLDA)15. The probability matrix of lncRNA-disease association is calculated by integrating the above information. Then the lncRNA similarity and disease similarity are obtained based on Gaussian kernel similarity. Finally, the lncRNA-disease association score is obtained by combining the disease space projection score and lncRNA space projection score the effect of prediction. The disadvantage is that this method depends on the quality of the data, and the above methods have achieved good prediction results. A hybrid computing framework (SDLDA)16 was proposed by Zeng et al. It is a lncRNA-disease association prediction model based on singular value decomposition and deep learning. The model uses singular value decomposition and deep learning to extract the linear and nonlinear features of lncRNA-disease respectively, and combines the linear and nonlinear features to train SDLDA. The combination of linear and nonlinear features can enhance each other to obtain relatively high-quality features, and the connected vectors are used for the association prediction of lncRNA-disease. The performance of the prediction model has been greatly improved. The disadvantage is that it is difficult for SDLDA to determine the parameters. However, biological association information is generally affected by a variety of factors17, only through a single data prediction has certain limitations. The other part is to use multiple association data for prediction. Ding et al. Proposed a novel lncRNA-disease association prediction (TPGLDA)18. By integrating gene disease association and lncRNA-disease association, we can better describe the heterogeneity of coding non coding gene disease association and effectively identify potential lncRNA-disease association. Fu et al. proposed Matrix factorization-based data fusion for the prediction of lncRNA-disease associations (MFLDA)19. In this way, the weights of the data sources and the correlation matrix of the disease can be assigned to the data sources with less weight to break the potential association of lncRNA-disease. The biggest advantage of this model is that it is easy to predict the correlation between different research objects by sorting out a variety of heterogeneous data sources. However, MFLDA is more inclined to study data sparse matrix, and its performance depends on low-quality and unrelated internal relational data sources. Considering the different correlations between the incidence matrix and multiple internal incidence matrices, Wang et al. improved the MFLDA model proposed above, and proposed a model WMFLDA20 which decomposes the weighted matrix of multiple relational data. Firstly, the model constructs a heterogeneous network for different types of entities and multiple relational intranets works for the same type of entities. Then the weights are assigned to these networks, and the cooperative low rank matrix is decomposed. Then, the association between lncRNA-diseases was predicted based on the optimized low rank matrix. WMFLDA model can be applied to all kinds of link prediction problems, and can collect data sources among and within relationships. However, this model ignores the different correlations of multiple relational matrices to target prediction tasks. In addition, Liu et al. Proposed a method A Weighted Graph Regularized Collaborative Matrix Factorization Method for Predicting Novel lncRNA-Disease Associations (WGRCMF)21. When the known information is insufficient, the performance of the matrix factorization method decreases significantly. The model A Probabilistic Matrix Factorization Method for Identifying lncRNA-disease Associations (PMFLDA)22 developed by Xuan et al. Established a new weighted lncRNA-disease association network through three association networks of lncRNA-miRNA, miRNA disease and lncRNA-disease. The KNN algorithm based on disease semantic similarity and lncRNA function similarity is further updated. Finally, the potential lncRNA-disease association is inferred based on probability matrix decomposition. However, this model relies not only on miRNA and lncRNA association data, miRNA-disease association data, but also on lncRNA-disease association data. The above methods use multi-source data to predict the association between lncRNA and disease, but these methods still need the association between lncRNA and disease. However, lncRNA-disease association data are too sparse. In order to solve these problems, a new lncRNA-disease association prediction method LFMP is proposed in this paper. lncRNA-miRNA association data and miRNA-disease association data were used to calculate lncRNA similarity and disease similarity. The lncRNA-disease potential association was constructed through these two data sets. In the absence of known lncRNA-disease association data, the prediction of unknown lncRNA-disease association data is realized. The simulation results show that the AUC of LFMP can reach 0.8964 under the LOOCV framework. Better than the latest results. Through case studies of lung and colorectal tumors, it is proved that LFMP can effectively infer the undetected lncRNA-disease association.

Results

Evaluation metrics

In order to evaluate the performance of LFMP model, we used the ROC curve and AUC value generated by Leave One Out Cross Validation (LOOCV) as the evaluation measure, and compared it with other advanced models, namely CFNBC23, NBCLDA24. Under the framework of LOOCV, we take the association between each lncRNA and the disease one by one as the test set, By comparing the calculated results with the given threshold, we get four evaluation indexes: True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN).The True Positive Rate (TPR) and False Positive Rate (FPR) were calculated by the following formula:

$$TPR = \frac{TP}{{TP + FN}},$$
(1)
$$FPR = \frac{FP}{{FP + TN}}.$$
(2)

AUC is a performance index to measure the performance of the model. When AUC = 1, the model is perfect; When AUC = [0.85, 0.95], the model is excellent. When AUC = [0.7, 0.85], the performance of the model is general.

Comparison with other methods

As shown in the Fig. 1. Based on a 190 known lncRNA-disease associated data set, the AUC values of LFMP under the LOOCV framework and fivefold framework are 0.9085 and 0.9072 respectively. Considering that the biological information used by CFNBC, and NBCLDA is the same as that of LFMP, we compare LFMP with the above three prediction models in the framework of LOOCV. The ROC and AUPR comparison chart based on LOOCV is shown in Fig. 2. It can be seen from ROC comparison chart and AUC value comparison table that CFNBC model is best 0.8576, NBCLDA model is 0.8521, and LFMP model is 0.8964. Obviously, LFMP model is slightly better than other models in ROC curve and AUC value.

Figure 1
figure 1

The performance of LFMP in terms of ROC curves and AUC based on 190 known lncRNA-disease associations under the framework of LOOCV frameworks (Left) and fivefold frameworks (Right).

Figure 2
figure 2

ROC and AUPR comparison between LFMP model and other advanced models based on 407 known lncRNA disease associated LOOCV frameworks.

Analysis of parameters

In this model, we introduce the parameter ω, whose value range is [0,1]. This parameter is used to adjust the ratio of lncRNA projection fraction and disease projection fraction in the final result calculation. We conducted the experiment with the parameter of 0 and the increment of 0.1, and the results are shown in Fig. 3. It is easy to see that when \(\omega\) = 0, only lncRNA-miRNA is used to calculate functional similarity, AUC is 0.8892; when \(\omega\) = 1, only lncRNA-disease is used to calculate functional similarity, AUC is 0.8693, while the fused lncRNA similarity matrix is used and the AUC is 0.8964, when \(\omega\) = 0.3, which proves that the fusion functional similarity has certain advantages.

Figure 3
figure 3

ROC was calculated by lncRNA projection, disease projection and proportional fusion and transformation curve of parameter in the range of [0,1].

Case studies

In order to further prove LFMP’s potential ability to detect potential lncRNAs associated with diseases, several common diseases were analyzed, and we obtained the rank of related disease prediction through experiments and ranked it. We verified the top 15 lncRNAs by searching the literature, selected the verified lncRNAs and attached the PMID (PMID is the literature number in the fields of life science and medicine included in the PubMed search engine) of relevant supporting literature, as shown in Table 1. Lung cancer (LC) ranks the top three in the world’s cancer incidence rate, ranking the first in cancer death cause in Germany, and the incidence rate of male and female morbidity is 25% and 15% respectively25. The original treatment was surgical resection, but not all patients were treated with surgical resection, so the survival rate of patients with lung cancer is very low, about 19%26. With the development of bioinformatics, lncRNA, miRNA and other genes have been found to be closely linked with various diseases, and various new lung cancer diagnosis methods and non-surgical treatment methods have emerged, bringing the hope of cure for the majority of lung cancer patients to cure27,28. Among the top 15 candidate lncRNAs in our prediction results, 7 lncRNAs have been shown to be associated with lung Neoplasms, in which the lncRNA XIST promote the proliferation and migration of non-small cell lung cancer cells via sponging miR-16 and regulating CDK8 expression29; Long Noncoding RNA KCNQ1OT1 Promotes the Progression of Non-Small Cell Lung Cancer via Regulating miR-204-5p/ATG3 Axis30; lncRNA NEAT1 Interacted With DNMT1 to Regulate Malignant Phenotype of Cancer Cell and Cytotoxic T Cell Infiltration via Epigenetic Inhibition of p53, cGAS, and STING in Lung Cancer31. lncRNA OIP5-AS1 was strongly expressed in lung cancer tissues, which was correlated with tumor size and tumor growth rate. Overexpression of OIP5-AS1 increased the proliferation of lung cancer cells in vitro32.

Table 1 Candidate lncRNAs and its top 15 cases and related literature.

Colorectal cancer (CRC) is also among the top three cancers in the world, the third most common cancer in men (746,000 cases, 10.0% of the total) and the second most common cancer in women (614,000 cases, 9.2 of the total)33. Among the top 15 candidate lncRNAs in our prediction results, 9 have been shown to be associated with colorectal Neoplasms in which MALAT1 polymorphism inhibits the binding of mir-194-5p, leading to the risk, growth and metastasis of colorectal cancer34; the long non-coding RNA HCG18 promotes the growth and invasion of colorectal cancer cells through sponging miR-1271 and upregulating MTDH/Wnt/β-catenin35; lncRNA MALAT1 promotes the colorectal cancer malignancy by increasing lncRNA DCP1A expression and miR203 downregulation36. Long Non-Coding RNA SNHG16 Activates USP22 Expression to Promote Colorectal Cancer Progression by Sponging miR-132-3p37.

Discussion

The research of lncRNA and disease association prediction calculation model has been a hot spot. Using computational models to predict the association between lncRNA and diseases can accelerate the discovery of the potential association between lncRNA and diseases, improve the accuracy of biological experiments, reduce the energy of bioinformatics researchers and the cost of biological experiments, and help doctors improve the early detection and targeted treatment of some diseases. At present, there are a large number of lncRNA-disease prediction models. Most of these models use the association information between lncRNA and disease to predict the unknown lncRNA-disease association, and the most important step to predict the unknown association is the lncRNA-lncRNA similarity calculation and disease-disease similarity calculation. It is commonly used to calculate lncRNA-lncRNA similarity and disease-disease similarity through lncRNA-disease association information. This method has both advantages and disadvantages. The advantage is that the lncRNA-lncRNA calculated directly from the lncRNA-disease association information has more credibility in the prediction of lncRNA-disease association information. However, the disadvantage is that the known lncRNA-disease association information is too sparse, resulting in the lack of known information, which makes the credibility decline. Therefore, we use lncRNA-miRNA association information to calculate lncRNA-lncRNA similarity and miRNA-disease association information to calculate disease-disease similarity. The introduction of miRNA as an intermediate variable makes the credibility of the calculated lncRNA-lncRNA similarity and disease-disease similarity in the prediction of lncRNA-disease association decrease. However, due to the known lncRNA-miRNA association information and miRNA-disease association information are more perfect, the credibility of the calculated lncRNA-lncRNA similarity is improved, Moreover, the introduction of miRNA can solve the problem of lack of lncRNA-disease association information, and provide great help for the prediction of unknown lncRNA-disease association.

Conclusion

In this study, we propose a lncRNA-disease association prediction model LFMP based on implicit semantic model and projection. The model integrates multiple data, namely lncRNA-miRNA association data and miRNA-disease association data, and realizes indirect prediction of lncRNA-disease association, that is, the model does not need to be based on the known lncRNA-disease association data to predict the association between lncRNA and disease. By comparing with other models and consulting literature to verify the prediction results, it is proved that LFMP has certain reliability and good prediction ability. It is undeniable that our calculation model also has some limitations. Using multivariate data to calculate is a double-edged sword. It helps to improve the reliability of prediction, but also increases the difficulty of obtaining data. Compared with single data association prediction, this model needs more stringent data preprocessing methods, and the model relies too much on the known lncRNA-miRNA association data and miRNA-disease association data. If these two data are too sparse, the prediction performance of the model will be affected.

Methods

Dataset and preprocessing

Download the known lncRNA-disease association datasets from MNDRv2.0 database (2017 Edition)(Supplementary File 3)38, Download known miRNA-disease association datasets from HMDD database (2018 Edition)(Supplementary File 1)39. Download the known lncRNA-miRNA association datasets from Starbase v2.0 database (2015 Edition)(Supplementary File 2)40. The data obtained is cleaned up and the data is finally obtained as shown in Table 1. lncRNA-miRNA adjacency matrix \({\text{A}}_{{{\text{LM}}}} = \{a^{lm} \} m \times n\), miRNA-disease adjacency matrix \({\text{A}}_{{{\text{MD}}}} = \{ a^{md} \} n \times e\) are constructed from lncRNA-miRNA association data set, miRNA-disease association data set. The construction of adjacency matrix is shown in Fig. 4, the experimental data are shown in Table 2.

Figure 4
figure 4

Flow chart of LFMP Applied to lncRNA-disease association prediction.

Table 2 List of experimental data.

Cosine similarity for diseases

The cosine similarity for disease between miRNA disease adjacency matrix was calculated:

$$CD(i,j) = \frac{A_{MD}(:,i) \times A_{MD}(:,j)}{{||A_{MD}(:,i)||||A_{MD}(:,j)||}},$$
(3)

where \(A_{MD}(:,i)\) is the i-th column vector in the adjacency matrix of miRNA and disease, which represents the association feature of disease i.

Jaccard similarity for diseases

The calculation of similarity is an important part of gene association prediction. At present, the methods of similarity calculation in most articles include Gauss interactive calculation of similarity. Compared with the past, we use Jaccard similarity to calculate. The Jaccard similarity for disease between miRNA disease adjacency matrix was calculated:

$$JD(i,j) = \frac{A_{MD}(:,i) \cap A_{MD}(:,j)}{{A_{MD}(:,i) \cup A_{MD}(:,j)}}.$$
(4)

\(AMD(:,i) \cap AMD(:,j)\) is the number of miRNAs associated with disease i and disease j,\(A_{MD}(:,i) \cup A_{MD}(:,j)\) is the sum of miRNAs related to disease i and disease j.

Integrated disease semantic similarity matrix

Integrated disease semantic similarity DS and cosine similarity CD for diseases:

$$IDS(i,j) = \left\{ {\begin{array}{*{20}l} {\frac{{CD(i,j) + JD(i,j)}}{2}} \hfill & {if\;CD(i,j) \ne 0;} \hfill \\ {JD(i,j)} \hfill & {if\;CD(i,j) = 0;} \hfill \\ \end{array} } \right..$$
(5)

Cosine similarity for lncRNA

The cosine similarity for lncRNA between lncRNA-miRNA adjacency matrix was calculated:

$$CL(i,j) = \frac{A_{LM}(i,:) \times A_{LM}(j,:)}{{||A_{LM}(i,:)||||A_{LM}(j,:)||}}.$$
(6)

Jaccard similarity for lncRNA

The Jaccard similarity for lncRNA between lncRNA-miRNA adjacency matrix was calculated:

$$JL(i,j) = \frac{A_{LM}(i,:) \cap A_{LM}(j,:)}{{A_{LM}(i,:) \cup A_{LM}(j,:)}}.$$
(7)

Integrated lncRNA similarity matrix

Integrated miRNA similarity MS and cosine similarity CL for lncRNA:

$$ILS(i,j) = \left\{ {\begin{array}{*{20}l} {\frac{{CL(i,j) + JL(i,j)}}{2}} \hfill & {if\;CL(i,j) \ne 0;} \hfill \\ {JL(i,j)} \hfill & {if\;CL(i,j) = 0;} \hfill \\ \end{array} } \right..$$
(8)

Calculation of latent factor model

Compared with previous studies41,42, the matrix of lncRNA-disease association was calculated by using the adjacency matrix \({\text{A}}_{{{\text{LM}}}} = {{\{ }}a^{lm} {{\} }}m \times n\) composed of lncRNA-miRNA association information and the adjacency matrix \({\text{A}}_{{{\text{MD}}}} = {{\{ }}a^{md} \} n \times e\) composed of miRNA-disease association information, which was defined as follows:

$$A_{LD} = A_{LM} \times A_{MD} .$$
(9)

The matrix \(A{}_{LD} = \{ a^{ld} \} m \times e\) represents the preliminary correlation score between lncRNA and disease. However, the matrix is still too sparse. In order to solve this problem, we use the latent factor model to calculate the potential score. For matrix \(A_{LD} = \{ a^{ld} \} m \times e\), it can be expressed approximately by the product ψ of two matrices X and Y:

$$\psi ij = X_{i}^{T} Y_{j} = \sum\limits_{k = 1}^{K} {{\text{x}}_{ik} y_{kj} } .$$
(10)

X is the lncRNA feature matrix, Y is the disease feature matrix, and k is an implicit class. X and Y are obtained by \(A{}_{LD}\) decomposition, Conversely, the lncRNA feature matrix X is multiplied by the disease feature matrix Y to obtain the lncRNA-disease score matrix ψ (compared with the \(A{}_{LD}\) matrix, the ψ matrix has a score for the zero part of the \(A{}_{LD}\) matrix, while the corresponding part of the ψ matrix is about equal to \(A{}_{LD}\) for the non-zero part of the \(A{}_{LD}\) matrix), where in the element in the lncRNA-disease score matrix ψ is the dot product of the corresponding characteristic vector in the matrix X and the matrix Y, It reflects the fit between lncRNA feature and disease feature. Therefore, the larger the number in ψ, the greater the association between lncRNA and disease. In order to obtain the target value, we use the gradient descent method to solve the problem, the loss function is defined as:

$$L(X,Y) = \sum\limits_{{\left( {i,j} \right) \in K}} {(\psi ij - X_{i}^{T} Y_{j} } )^{2} + \lambda \sum\limits_{i} {||X_{i} ||^{2} } + \lambda \sum\limits_{j} {||Y_{j} ||^{2} } .$$
(11)

Here, \(||X_{i} ||\) and \(||Y_{j} ||\) are regularization terms used to prevent over fitting, and λ can be obtained experimentally. For each \(X_{i}\), the partial derivative is obtained:

$$\frac{\partial L}{{\partial X_{i} }} = \frac{{\partial \left[ {\sum\nolimits_{i,j} {\left( {\psi_{ij} - X_{i}^{T} Y_{j} } \right)}^{2} + \lambda \sum\limits_{i} {||X_{i} ||^{2} } } \right]}}{{\partial X_{i} }} = \sum\limits_{j} 2 \left( {X_{i}^{T} Y_{j} - \psi_{ij} } \right)Y_{j} + 2\lambda X_{i} .$$
(12)

Then, according to the random gradient descent method, the parameters need to be pushed forward along the fastest descent direction. Therefore, the following recurrence formula can be obtained:

$$X_{i} = X_{i} - \alpha \frac{\partial L}{{\partial X_{i} }},$$
(13)

where \(\alpha\) is the learning rate, Combine formula (12) with formula (13):

$$X_{i} = X_{i} - \alpha \sum\nolimits_{j} {2(X_{i}^{T} Y_{j} - \psi ij)Y_{j} + 2\lambda X_{i} } .$$
(14)

Similarly, we can get:

$$Y_{j} = Y_{j} - \alpha \sum\nolimits_{i} {2(X_{i}^{T} Y_{j} - \psi ij)X_{i} + 2\lambda Y_{j} } .$$
(15)

In our experiment, α is set to 0.0002 and λ is set to 0.004.

Establishment of LFMP prediction model

This paper proposes a new LFMP prediction model by combining the latent factor model and projection. The flow chart of LFMP model is shown in Fig. 4. Compared with previous studies43, we further extended the network consistency projection from single lncRNA-disease association data to multivariate data, such as lncRNA-miRNA association data, miRNA-disease association data, and so on. The lncRNA-disease potential score matrix was calculated by the latent factor model. On the lncRNA-disease potential correlation matrix, the functional similarity of the fused lncRNA and the comprehensive disease risk factors were combined the semantic similarity of disease was used to project lncRNA and disease respectively. The projection of lncRNA is defined as:

$$LP(i,:) = \frac{{\sum\limits_{j = 1}^{m} {(ILS(i,j) \times \psi (j,:))} }}{||ILS(i,:)||}.$$
(16)

In the above formula, \(ILS(i,:)\) represents the vector composed of the similarity between lncRNA i and other kinds of lncRNA. \(\psi (j,:)\) is potential score matrix between lncRNA j and various diseases. \({||}I{\text{LS}}(i,:){||}\) is the second normal form of vector formed by column i of integrated similarity matrix of lncRNA. \(LP(i,j)\) is the projection score. m is the number of lncRNA species. The projection of disease is defined as:

$$DP(:,j) = \frac{{\sum\limits_{i = 1}^{e} {(IDS(i,j) \times \psi (:,i)} )}}{||IDS(:,j)||}.$$
(17)

\(IDS(:,j)\) represents the vector composed of the similarity between disease j and other diseases. \(\psi (:,i)\) represents the second normal form of the vector formed by row i of lncRNA-disease potential score matrix. \(DP(i,j)\) is the projection score. e is the number of diseases.

The final lncRNA-disease potential association prediction score matrix was formed by fusing lncRNA projection score with disease projection:

$$LFMP(i,j) = \frac{{\omega LP(i,j) + \left( {1 - \omega } \right)\;DP(i,j)}}{||ILS(i,:)|| + ||IDS(:,j)||}.$$
(18)

\(LFMP(i,j)\) is the final association score between lncRNA i and disease j. ω means to regulate lncRNA projection and disease projection in the final result.