Introduction

Noncoding RNAs regulate the majority of biological processes associated with development, differentiation, and metabolism in organisms1. In contrast to small noncoding RNAs (i.e., miRNAs), which are highly conserved and regulate transcriptional and posttranscriptional gene silencing2,3, long noncoding RNAs (lncRNAs), as one type of transcribed RNA molecules, are poorly conserved and control gene expression based on various mechanisms4,5,6. lncRNAs have close linkages with posttranscriptional gene regulation by regulating biological processes including protein synthesis, RNA maturation and transportation, and transcriptional gene silencing7,8. Although a few lncRNAs have been well studied, the biological functions of the majority of lncRNAs remain enigmatic9. Recent studies demonstrate that most of lncRNAs regulate various biological activities through specific associations with chromatin, for example, interacting with corresponding RNA-binding proteins10,11,12. Therefore, identification of potential lncRNA–protein Interactions (LPIs) is vital to understand lncRNAs’ biological functions and mechanisms.

To find new LPIs, many experimental methods were designed13,14. However, wet experiments for finding possible LPIs are costly and time-consuming. Computational methods are thus developed as a silver-bullet solution to LPI prediction. This type of methods is classified into two main categories: network-based methods and machine learning-based methods15,16.

Network-based LPI prediction methods, for example, random walk with restart-based model17, linear neighborhood propagation algorithm18, bipartite network projection-based recommendation method19,20,21, HeteSim algorithm22, firstly computed lncRNA similarity and protein similarity based on related biological data, and then integrated similarity matrix to heterogeneous lncRNA–protein network, finally designed network propagation algorithms to score for unknown lncRNA–protein pairs. Network-based LPI prediction methods successfully found part of LPIs, however, the type of methods cannot be applied to predict linkage information for an orphan lncRNA or protein.

Machine learning-based LPI identification methods first extracted features of lncRNAs and proteins and then designed a novel machine learning model to compute interaction probabilities for lncRNA–protein pairs. Classical machine learning-based LPI prediction models include matrix factorization-based methods and ensemble learning-based methods. Matrix factorization-based methods represented LPI prediction as a recommender task and used diverse matrix factorization models to discover unobserved LPIs, for example, gradient boosted regression trees23, graph regularized nonnegative matrix factorization24, and neighborhood regularized logistic matrix factorization25,26. Ensemble learning-based methods utilized ensemble techniques and constructed ensemble models for new LPIs identification27,28, for example, random forest-based ensemble framework29, sequence feature projection-based ensemble algorithm30, broad learning system-based stacked ensemble classifier31, and graph attention-based deep learning model32.

Although computational methods effectively identified potential linkages between lncRNAs and proteins, most of the above models remain the following limitations. First, the performance of these models was evaluated only on one dataset, thereby producing prediction bias. Second, the vast majority of models are not applied to find possible association proteins (or lncRNAs) for lncRNAs (or proteins) without any interaction information. Third, the performance needs to be further improved. To solve the above three problems, in this study, known LPI data are firstly integrated and five different LPI datasets are collected. Second, the features of lncRNAs and proteins are extracted based on four-nucleotide composition and the BioSeq2vec methods, respectively. Finally, a Deep Forest model (LPIDF) with cascade forest structure is designed to find LPI candidates. We compare the proposed LPIDF method with four classical LPI prediction models based on three different cross validations. The results show that LPIDF obtains better average AUCs and the best average AUPRs on the five datasets under the three cross validations. More importantly, case studies demonstrate that most of our predicted lncRNA–protein pairs with higher interaction probabilities are true LPIs and the remaining needs further experimental validation.

Results

We perform a series of experiments to investigate the prediction performance of our proposed LPIDF method.

Evaluation metrics

In this study, precision, recall, accuracy, F1-score, AUC and AUPR are used to evaluate the performance of LPIDF. Precision, recall, accuracy, and F1-score are defined as follows.

$$Precision=\frac{TP}{TP+FP}$$
(1)
$$\begin{array}{c} Recall=\frac{TP}{TP+FN} \end{array}$$
(2)
$$Accuracy=\frac{TP+TN}{TP+FP+TN+FN}$$
(3)
$$\begin{array}{c}F1-score=\frac{2\times Precision\times Recall}{Precision+Recall} \end{array}$$
(4)

where TP, FP, TN, and FN denote the predicted number of true LPIs, false LPIs, true non-LPIs, and false non-LPIs. AUC and AUPR denote the average areas under the ROC curve and the precision-recall curve, respectively. The experiments are repeated for 20 times and the average performance from the 20 rounds is computed as the final performance.

Experimental settings

In the study, we conduct three different experimental settings.

Five-fold Cross Validation 1 (CV1): Cross validation on lncRNAs, that is, random rows (i.e., lncRNAs) in an LPI matrix \(Y\) are masked for testing.

Five-fold Cross Validation 2 (CV2): Cross validation on proteins, that is, random columns (i.e., proteins) in an LPI matrix \(Y\) are masked for testing.

Five-fold Cross validation 3 (CV3): Cross validation on lncRNA–protein pairs, that is, random lncRNA–protein pairs in an LPI matrix \(Y\) are masked for testing.

Under CV1, in each round, 80% of lncRNAs in an LPI network \(Y\) are screened as training set and the remaining is represented as testing set. Under CV2, in each round, 80% of proteins in \(Y\) are screened as training set and the remaining is represented as testing set. Under CV3, in each round, 80% of lncRNA–protein pairs in \(Y\) are represented as training set and the remaining is represented as testing set. The three cross validations refer to LPI identification for (1) new (unknown) lncRNAs (lncRNAs whose interaction information is unknown), (2) new proteins, and (3) lncRNA–protein pairs, respectively.

Comparison with four state-of-the-art methods

We compare our proposed LPIDF method with four state-of-the-art association identification methods to evaluate the prediction ability and robustness of LPIDF, that is, XGBoost33,34, Categorical Boosting (CatBoost)35, random forest36,37, and DRPLPI38. The above methods are classical machine learning models and obtained wide applications in various areas. XGBoost33,34 is a scalable and end-to-end tree boosting-based model. CatBoost35 is a novel gradient boosting-based technique and can effectively integrate ordered boosting and processing categorical features. Random forest36,37 is composed of multiple decision trees and each tree is independently trained on a random subset. DRPLPI38 exploited a multi-head self-attention model to extract high quality LPI features based on long short-term memory encoder-decoder mechanism. In the experiments, we randomly select the same number of negative LPIs as positive LPIs from unknown lncRNA–protein pairs to decrease the overfitting problem produced by data imbalance.

In random forest, the number of trees is set as 70, and the minimum number used to split samples is set as 5. In CatBoost, the maximum number of trees is set as 150, the maximum depth as 15, and the learning rate as 0.5. Other parameters are set as the corresponding values provided by the corresponding manuscript. XGBoost is conducted based on the scikit-learn package39.

Table 1 shows the precision, recall, accuracy, F1-score, AUC and AUPR values computed by LPIDF and other four methods under CV1. As shown in Table1, LPIDF achieves the highest average precision, accuracy, F1-score, and AUPR over all datasets, remarkably outperformed other four competing LPI prediction methods. Although the average recall and AUC computed by LPIDF are slightly lower than random forest and DRPLPI, LPIDF obtains the best average AUPR. The computed average AUPR obtained by LPIDF is 0.9022, which is 0.96%, 2.10%, 0.02% and 0.63% higher than XGBoost, CatBoost, random forest, and DRPLPI, respectively. Compared to AUC, AUPR is one more important measurement metric. Therefore, LPIDF can effectively find potential proteins interacting with a new lncRNA.

Table 1 The performance of five LPI prediction methods on CV1.

Table 2 gives the comparison results under CV2. In particular, LPIDF computes the best average precision, recall, accuracy, F1-score, AUC and AUPR over all datasets. Over all datasets, LPIDF investigates the best average AUC value of 0.6937, which is 4.80%, 10.81%, 1.17% and 0.91% better than XGBoost, CatBoost, random forest, and DRPLPI, respectively. More importantly, LPIDF calculates the highest average AUPR value of 0.6860, which is 2.17% and 2.65% higher than the second-best and third-best methods, respectively. In summary, under CV2, LPIDF remarkably improves LPI prediction performance compared to the other four prediction methods and is statistically significant in identifying possible lncRNAs for a new protein.

Table 2 The performance of five LPI prediction methods on CV2.

The prediction results computed under CV3 are shown in Table 3. In particular, LPIDF outperforms other LPI prediction methods over all datasets in terms of all six measurements. For example, LPIDF achieves the best average AUC value of 0.9457, which is 1.72%, 6.39%, 0.87%, and 0.97% better than XGBoost, CatBoost, random forest, and DRPLPI, respectively. In addition, for the AUPR metric, LPIDF obtains the best average AUPR of 0.9382, which is 0.88% and 1.20% superior to the second-best and third-best methods, respectively. It can be seen that the LPIDF can effectively predict potential LPIs.

Table 3 The performance of five LPI prediction methods on CV3.

Case study

After confirming the performance of our proposed LPIDF method, we further identify possible LPIs, especially predict interaction information for new lncRNAs and proteins.

Finding possible proteins interacting with new lncRNAs

In this section, we intend to find potential proteins interacting with new lncRNAs. Small Nucleolar RNA Host Gene 3 (SNHG3) and Growth Arrest-Special transcript 5 (GAS5) are masked all association information and taken as new lncRNAs. LPIDF is then applied to identify possible proteins interacting with the two lncRNAs.

SNHG3 is an RNA Gene affiliated with the lncRNA class. It may have dense correlation with various cancers, for example, hepatocellular carcinoma40, non-small-cell lung cancer41, clear cell renal cell carcinoma42, gastric cancer43, hypoxic-ischemic brain damage44, papillary thyroid carcinoma45, ovarian cancer46,47, bladder cancer48, and acute myeloid leukemia49. Table 4 shows the predicted top 5 proteins related to SNHG3 with the highest interaction probabilities on three human datasets.

Table 4 The predicted top 5 proteins interacting with SNHG3.

The results from Table 4 show that SNHG3-protein interaction pairs predicted by LPIDF are rank advanced in all other four methods. We predict that O00425 may interact with SNHG3 (ranked as 4) in dataset 3, which has been validated in dataset 1. In addition, we observe that Q9NUL5 and Q13148 may interact with SNHG3. Among all possible 27 proteins, the interaction between Q9NUL5 and SNHG3 is ranked as 1 by all five LPI prediction methods. The association between Q13148 and SNHG3 is ranked as 5, 7, 8, 5, and 4 by LPIDF, XGBoost, random forest, CatBoost, and DRPLIP, respectively. The facts demonstrate the powerful prediction performance of LPIDF.

GAS5 can prevent glucocorticoid receptors from being activated and thus control transcriptional activities from its target genes. It is inferred as a potential tumor suppressor and has close correlations with coronary artery disease50, cirrhotic livers51, coronary artery disease52,53, rheumatoid arthritis54, Parkinson’s disease55, and primary glioblastoma56.

Table 5 lists the predicted top 5 proteins interacting with GAS3 with the highest association scores on three human datasets. In dataset 3, although the interactions between GAS5 and Q9NZI8 and Q9Y6M1 are unknown, we find that the two LPIs are ranked as 5 and 4 by LPIDF, respectively. More importantly, in datasets 1 and 2, it can be seen that Q9NZI8 and Q9Y6M1 show higher interaction probabilities with GAS5 and the two LPIs have been reported. In addition, O00425 is inferred to interact with GAS5 with the ranking of 2 in dataset 3 and has been validated in dataset 1. These facts again suggest that LPIDF can effectively find possible proteins associated with a new lncRNA.

Table 5 The predicted top 5 proteins interacting with GAS5.

Finding potential lncRNAs interacting with new proteins

We continue to uncover lncRNAs interacting with a new protein on three human datasets. Q13148 and Q9HCK5 are masked all associated lncRNAs and taken as new proteins. LPIDF is then used to find possible associated lncRNAs for the two proteins.

Q13148 is an RNA-binding protein involved in RNA biogenesis and processing and various neurodegenerative diseases57,58,59,60. In addition, it also participates in the formation and regeneration of normal skeletal muscles and plays an important role in keeping the circadian clock periodicity59,60. Its second RNA recognition motif has been reported as a major promoter towards aggregation and resultant toxicity61. Frontotemporal lobar degeneration associated with Q13148 aggregation is depicted as progressive neuronal atrophy in cerebral cortex62. Table 6 illustrates the predicted top 5 lncRNAs associated with Q13148 on three human datasets. From Table 6, we can investigate that all predicted top 5 lncRNAs interacting with Q13148 are known in the three datasets.

Table 6 The predicted top 5 lncRNAs interacting with Q13148.

Table 7 lists the identified top 5 lncRNAs associated with Q9HCK5 on three human datasets. Q9HCK5 is required for RNA-mediated genes’ silencing, RNA-directed transcription and human hepatitis delta virus replication63. Table 7 demonstrates that all predicted top 5 LPIs for Q9NCK5 are given in the three datasets. In summary, LPIDF may be appropriate for LPI identification for a new protein.

Table 7 The predicted top 5 lncRNAs interacting with Q9HCK5.

Finding new LPIs based on known LPIs

The number of lncRNA–protein pairs with unknown interaction information is 51,686, 71,075, 22,572, 2,867 and 49,435 on the five datasets, respectively. We rank these unknown lncRNA–protein pairs based on their interaction probabilities computed by LPIDF and list the predicted top 100 lncRNA–protein pairs. The results are shown in Fig. 1. In Fig. 1, black dotted lines and sky blue solid lines represent unknown and known LPIs predicted by LPIDF, respectively. Tan hexagons and light sky blue circulars denote lncRNAs whose interactions with proteins are unknown and known, respectively. Yellow diamonds denote proteins.

Figure 1
figure 1

The predicted top 100 LPIs on the five datasets (a) Dataset 1, (b) Dataset 2, (c) Dataset 3, (d) Dataset 4, (e) Dataset 5.

We observe that some identified lncRNA–protein pairs have higher interaction probabilities. For example, the interactions between NONHSAT137627 and P35637, n344749 and Q15717, NONHSAT119864 and Q15717, AthIncRNA18 and Q9LES2, and ZmaLncRNA38 and C4J594 are ranked as 33, 97, 85, 161, and 215, respectively. The lncRNA–protein pairs with advanced ranks need further experimental validation.

The lncRNA FTX (NONHSAT137627) can positively regulate the expression and function of ALG3 in AML cells, especially cell growth and apoptosis related to ADR-resistance. FTX could thus probably be applied to reduce therapeutic resistance in AML64. P35637 plays a key role in RNA transport, mRNA stability and synaptic homeostasis in neuronal cells63. The protein has been validated to be target of the treatment of cancers, amyotrophic lateral sclerosis, and Alzheimer’s disease65.

In dataset 2, it is observed that FTX interacts with Q15717, Q9NZI8, and P26599. Q15717 helps in increasing the leptin mRNA’s stability. Q9NZI8 can regulate neurite outgrowth and neuronal cell migration, promote tumor-derived cells’ adhesion and movement, and prevent infectious HIV-1 particles’ formation64. P26599 can bind to the viral internal ribosome entry site and stimulate the translation mediated by the picornaviruses’ infection site. Q35637 has similar functions with Q15717, Q9NZI8, and P26599. Based on the “guilt-by-association” theory, we infer that FTX may associate with P35637.

Fractions of true LPIs among the predicted top N LPIs

In addition, we consider the fractions of true LPIs among the inferred top \(N\) LPIs. The results are shown in Table 8. \(N\) is selected as 10, 30, and 50, respectively. From Table 8, we can find that all the predicted top 10 LPIs by LPIDF have been labeled as 1 on five datasets. Similar to top 10, we can obtain the same fraction results on the predicted top 30 LPIs. For the predicted top 50 LPIs by LPIDF, although only 94% of LPIs have been labeled as 1 in dataset 1, all the top 50 LPIs are known on other four datasets. In summary, LPIDF obtains the best prediction performance based on fractions of true LPIs among the top 10, 30, and 50 LPIs.

Table 8 The fractions of true LPIs among the top \(N\) interactions under CV3.

Discussion and conclusion

lncRNAs are widely distributed in various organisms and regulate gene expression on transcriptome and post-transcriptome. However, lncRNAs are difficult to crystallize and only several lncRNAs have been investigated. Since lncRNAs play an important regulatory role in protein molecules, the discovery of proteins binding to specific lncRNAs becomes an issue to identify lncRNAs’ functions and mechanisms.

In this study, first, we integrate five LPI datasets where three datasets are from human and the remaining is from plants. Second, features of lncRNAs and proteins are selected by four-nucleotide composition and BioSeq2vec based on their sequences, respectively. Finally, a deep forest model with cascade forest structure, LPIDF, is developed to predict LPI candidates. To evaluate the performance of LPIDF, we compare our proposed LPIDF method with other four LPI prediction models on five datasets under three cross validations. The results suggest that LPIDF remarkably outperforms other four competing LPI identification methods. We further conduct a series of case studies to find possible associated proteins (or lncRNAs) for new lncRNAs (or proteins) and potential LPIs. The results from case analyses again demonstrate that LPIDF is a powerful LPI identification method.

LPIDF can compute the optimal precision, recall, accuracy, F1-score, AUC and AUPR. We think that it may be attribute to the following advantages. First, LPIDF selects high quality features of lncRNAs and proteins based on four-nucleotide composition and BioSeq2vec, respectively. Second, deep forest with cascade forest structure could automatically determine the depths of cascade forest, thereby reducing prediction bias produced by parameter tuning. Finally, each layer in the cascade forest receives LPI features from the last layer and sends its result to the next layer. Since all layers are automatically generated, LPIDF need not set too many hyperparameters. The predominant experimental consequences indicate that LPIDF has a powerful ability in excavating new LPIs.

In addition, the time required for the proposed LPIDF model and other methods is investigated. The details are shown in Table 9. It can be seen that the time required for LPIDF is much lower than ones of CatBoost and DRPLPI.

Table 9 The time required for all LPI prediction methods.

However, our work has a few limitations. We only consider LPI prediction on human and plant LPI-related datasets. Indeed, other species closer human evolutionarily than plants should be investigated. In addition, the predicted LPIs with the highest interaction probabilities should be experimentally validated.

In the future, first, we will integrate more biological information, for example, disease symptom information, drug chemical structure, miRNA-lncRNA interactions. Second, we will consider the prediction performance of the proposed model on other species closer human evolutionarily than plants. Third, CD-Hit66 is one broadly used software for reducing sequence redundancy. To improve the performance of sequence analyses algorithms, we will further remove proteins with high sequence similarity in larger datasets based on CD-Hit. Finally, we will further conduct experimental validation for the predicted RNA-binding proteins.

Materials and methods

Data preparation

In this study, we integrate five different LPI datasets. Dataset 1 was provided by Li et al.17. Noncoding RNA–protein interaction data were firstly downloaded from the NPInter 2.0 database67. lncRNA and protein sequences were extracted from the NONCODE database 4.068 and the UniProt65 database, respectively. 3,487 LPIs from 938 lncRNAs and 59 proteins were obtained. We then remove lncRNAs and proteins whose sequences are unknown in the UniProt65, NPInter67 and NONCODE68 databases. Finally, we obtain 3,479 LPIs from 935 lncRNAs and 59 proteins.

Dataset 2 was provided by Zheng et al.22. Noncoding RNA–protein interaction, lncRNA and protein sequences were downloaded from NPInter 2.067, NONCODE 4.068, and UniProt65, respectively. They obtained 4,467 LPIs between 1,050 lncRNAs and 84 proteins. Similar to dataset 1, we further remove the lncRNAs and proteins whose sequences are unknown in the NONCODE68, UniProt65, and NPInter67 databases and obtain 3,265 LPIs from 885 lncRNAs and 84 proteins.

Dataset 3 was provided by Zhang et al.18. Experimentally validated LPIs between 1,114 lncRNAs and 96 proteins were extracted based on data resources compiled by Ge et al.69. The sequence and expression data of lncRNAs in 24 human tissues or cell types were downloaded from the NONCODE 4.0 database68. The sequence data of proteins were obtained from the SUPERFAMILY database70. lncRNAs without sequence or expression information and proteins without sequence information were removed. lncRNA (or protein) with only one associated protein (or lncRNA) were still removed. Finally, 4,158 LPIs from 990 lncRNAs and 27 proteins were selected.

Dataset 4 contains sequence information of lncRNAs and proteins about Arabidopsis thaliana from the plant lncRNA database (PlncRNADB71). LPI data can be obtained from http://bis.zju.edu.cn/plncRNADB. The dataset contains 948 LPIs from 109 lncRNAs and 35 proteins.

Dataset 5 contains sequence data of lncRNAs and proteins about Zea mays from the PlncRNADB database71. LPI data can be downloaded from http://bis.zju.edu.cn/plncRNADB. The dataset contains 22,133 LPIs from 1,704 lncRNAs and 42 proteins. Table 10 describes the details about the five datasets.

Table 10 The details of LPI data.

We describe an LPI network as a matrix \(Y\):

$${Y}_{ij}=\left\{\begin{array}{ll}1& \mathrm{if}\hspace{0.17em}\mathrm{ lncRNA }{l}_{i}\hspace{0.25em}\hspace{0.17em}\mathrm{interacts}\hspace{0.25em}\hspace{0.17em}\mathrm{with\,\, protein}\hspace{0.25em}{p}_{j}\\ 0& \mathrm{otherwise}\end{array}\right.$$
(5)

Feature selection

Feature selection of lncRNAs

Tri-nucleotide composition is effectively applied to characterize lncRNA sequences72. In this section, we use four-nucleotide composition to select lncRNA features. Given an lncRNA sequence \(L\) with the length of \(x\) where \({l}_{i}\in \{\mathrm{A},\mathrm{C},\mathrm{G},\mathrm{T}\}\) and \(i=\mathrm{1,2}, ...,x\), we use a four-tuple letter arrangement, for example, (A, A, A, A), (A, A, A, C), (A, A, A, G), …, (T, T, T, T), to compute the numeric matrix from \(L\).

Feature selection of proteins

The encoder-decoder structure can better describe sequence-to-sequence features73,74. Inspired by the sequence representation techniques provided by Sutskever et al.74 and Yi et al.75, we use Biological Sequence-to-vector (BioSeq2vec) representation learning method75 with encoder-decoder structure to characterize amino acids of a protein.

For a protein with sequence length of \(L\), first, a sliding window of size \(K\) is used to divide the sequence into \(L-K+1\) segments. Second, the segments are converted into hash values. Finally, the hash values are used as input of an autoencoder. As shown in Fig. 2, an input vector composed of the hash values is first mapped into a low-dimensional feature vector by an encoder. Second, the low-dimensional feature vector is reproduced as an input vector by a decoder. Finally, the reproduced low-dimensional feature vector in the final intermediate layer is used as features of a protein.

Figure 2
figure 2

Protein feature selection based on the encoder–decoder.

Deep forest with cascade forest structure

In this study, we utilize a Deep Forest with cascade forest structure (LPIDF) to find new LPIs. Deep forest with cascade forest structure, integrating deep forest and ensemble learning, exploits an ensemble-ensemble architecture. In the model, deep forest76 conducts layer-by-layer propagation and feature transformation. Ensemble learning-based model, composed of multiple single classifiers, more effectively improves LPI prediction compared with one single classifier77. For ensemble learning, larger diversities between single classifiers mean better improvement. To ensure the diversity, in this study, four different types of classifiers, logistic regression, XGBoost Classifier, random forest, and extra trees, are utilized to learn the model.

In the model, class vectors used to denote the class distribution are obtained through the four basic classifiers. For a given LPI feature, the class distribution first calculated the proportions that the feature classifies an lncRNA–protein pair as two classes (positive class and negative class), respectively. Suppose that there are three trees in a random forest. As shown in Fig. 3, for a LPI feature \({f}_{i}\), the probabilities that \({f}_{i}\) classify an lncRNA–protein pair as two classes (positive class and negative class) in the three trees are \({(\mathrm{0.3750,0.6250})}^{T}\), \({(\mathrm{0.5556,0.4444})}^{T}\) and \({(\mathrm{1.0000,0}.0000)}^{T}\), respectively. The probabilities are then summed up and averaged and thus the final class distribution \({(\mathrm{0.6435,0.3565})}^{T}\) can be computed based on the feature \({f}_{i}\). That is, the probability that \({f}_{i}\) classify the lncRNA–protein pair as positive example is \((0.3750+0.5556+1.0000)/3=0.6435\) and the probability that \({f}_{i}\) classify the lncRNA–protein pair as negative sample is \((0.6250+0.4444+0.0000)/3=0.3565\).

Figure 3
figure 3

Computing the probability that a feature is classified as positive (or negative) sample.

Similarly, at each layer, for each LPI feature, logistic regression, XGBoost Classifier, random forest, and extra trees are trained. An 8-dimensional class vector is generated based on two classes and four types of classifiers.

Figure 4 shows a deep forest with cascade structure. As illustrates in Fig. 4, an 800-dimensional feature vector is used as the initial input to the cascade forest. After each layer, the generated eight-dimensional class vector with the most important information combining the old 800-dimensional features are used as the input at the next layer. The details are shown as follows. First, four different types of classifiers, logistic regression, XGBoost Classifier, random forest, and extra trees, are utilized to train the model. Second, an eight-dimensional class vector is picked and concatenated with the original 800-dimensional feature vector to generate an 808-dimensional vector. Third, an 808-dimensional class vector is used as the input at the second layer. Similarly, the second layer produces an eight-dimensional class vector, which will be concatenated with the 800-dimensional feature vector. And another 808-dimensional class vector is applied as the input at the third layer. Finally, when training on a new layer, a training set is used to tune the parameters and a validation set is utilized to evaluate the performance. The feature importance will be evaluated through the prediction difference between the original LPI features and the learned ones in the four different types of classifiers. The training process will be terminated when the performance is not significantly improved. After training, LPI features with zero importance values are removed and the features with valid importance values are kept. For a test example (an LPI feature), it will be represented by each level until the last level.

Figure 4
figure 4

Deep forest with cascade forest structure.

Figure 5 demonstrates the pipeline of LPIDF. First, five LPI datasets are obtained based on the existing resources. Second, for an lncRNA–protein pair, lncRNA and protein sequences are characterized and concatenated as a vector based on four-nucleotide composition and BioSeq2vec with encoder–decoder structure. Third, the concatenated vector is used as the input to the cascade forest. Finally, the most important features are selected based on layer-to-layer propagation and label of each lncRNA–protein pair is computed.

Figure 5
figure 5

Flowchart of the LPI prediction framework based on deep forest with cascade forest structure.