RMDGCN: Prediction of RNA methylation and disease associations based on graph convolutional network with attention mechanism

Lian Liu; Yumeng Zhou; Xiujuan Lei

doi:10.1371/journal.pcbi.1011677

Abstract

RNA modification is a post transcriptional modification that occurs in all organisms and plays a crucial role in the stages of RNA life, closely related to many life processes. As one of the newly discovered modifications, N1-methyladenosine (m¹A) plays an important role in gene expression regulation, closely related to the occurrence and development of diseases. However, due to the low abundance of m¹A, verifying the associations between m¹As and diseases through wet experiments requires a great quantity of manpower and resources. In this study, we proposed a computational method for predicting the associations of RNA methylation and disease based on graph convolutional network (RMDGCN) with attention mechanism. We build an adjacency matrix through the collected m¹As and diseases associations, and use positive-unlabeled learning to increase the number of positive samples. By extracting the features of m¹As and diseases, a heterogeneous network is constructed, and a GCN with attention mechanism is adopted to predict the associations between m¹As and diseases. The experimental results indicate that under a 5-fold cross validation, RMDGCN is superior to other methods (AUC = 0.9892 and AUPR = 0.8682). In addition, case studies indicate that RMDGCN can predict the relationships between unknown m¹As and diseases. In summary, RMDGCN is an effective method for predicting the associations between m¹As and diseases.

Author summary

As a new epitranscriptomic modification, m¹A plays an important role in the gene expression regulation, closely related to the occurrence and development of diseases.
However, due to the low abundance of m¹A, verifying the associations between m¹As and diseases through wet experiments requires a great quantity of manpower and resources.
It is especially important to develop computational methods for predicting the associations between m¹A modifications and diseases.
We developed a deep learning model to predict the associations of m¹As and diseases, namely RMDGCN.
RMDGCN increases the number of known relationships between m¹As and diseases through PU learning, and combines m¹A similarity network and disease similarity network to construct heterogeneous networks. It adopts GCN with layered attention mechanism to predict the associations between methylations and diseases.
The results of the 5-fold cross validation show that the performance of RMDGCN is superior to other comparison algorithms.
Through case study analysis of breast cancer, RMDGCN can effectively predict the relationships between unknown m¹As and diseases.

Citation: Liu L, Zhou Y, Lei X (2023) RMDGCN: Prediction of RNA methylation and disease associations based on graph convolutional network with attention mechanism. PLoS Comput Biol 19(12): e1011677. https://doi.org/10.1371/journal.pcbi.1011677

Editor: Mark Alber, University of California Riverside, UNITED STATES

Received: April 21, 2023; Accepted: November 10, 2023; Published: December 6, 2023

Copyright: © 2023 Liu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data and code of RMDGCN is freely available at https://github.com/lianliu09/RMDGCN.

Funding: This work has been supported by National Natural Science Foundation of China [61972451 to XL, 62272288 to XL]; Fundamental Research Funds for the Central Universities [GK202103091 to LL, GK202302006 to XL]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

RNA epigenetic modifications occur in almost all types of RNA and play an important role in all stages of life. Up to now, more than 170 different RNA modifications have been identified. As one of the modifications, N1-methyladenosine (m¹A) is a new epigenetic one, and the nitrogen in the first position of adenine is modified by a methyl group during the process [1]. The University of Chicago analyzed the methylation of m¹A in eukaryotic mRNA by m¹A RNA methylation sequencing and RIP sequencing techniques, and the appearance of the m¹A chemical modification was to significantly enhance protein translation of transcripts. It was further observed that m¹A modification is prevalent in humans, rodents and yeast, which suggested that m¹A modification is evolutionarily conserved [2]. With the development of high-throughput sequencing technologies, several studies have identified methylation sites for m¹A in nuclear and mitochondrial RNA. The results show that unlike the prevalent m⁶A modification, the abundance of m¹A is relatively low and most of the m¹A methylation modification sites are concentrated in the 5’UTR of the mRNA transcript, particularly at the first and second positions of the transcript initiation site. A small proportion of sequences were created by a known methylation enzyme complex, TRMT6 / 61A, conform to the ’GUUCRA’ sequence motif [3]. A large number of m¹A methylation modifications were found in the mitochondrially encoded transcripts by using m¹A mapping techniques. Among them, m¹A within the 5’UTR of mRNA transcripts can promote protein translation, while m¹A observed in the coding region of transcripts can lead to inhibition of translation. In addition, ALKBH1, an RNA repair enzyme, can function as an m¹A scavenger by catalyzing the demethylation reaction [4]. And m¹A can affect ribosome biosynthesis and mediate antibiotic resistance in rRNA [5], and mediate tRNA responses to environmental stress [6,7].

Some researchers have shown that genetic variants function can alter RNA modification by specifically replacing nucleotides at the modification site or altering the nucleotide sequence in the proximal flanking region [8]. More and more scientific studies have shown that m¹A methylation sites are closely associated with human diseases, such as cardiovascular disease [9], cancer [10], etc. Due to the high cost and complexity of wet experiments used to verify the associations between m¹A modification sites and diseases, predicting the associations between m¹A sites and diseases based on computational methods can help predict potential relationships between m¹As and diseases. Therefore, it is necessary to develop effective methods for predicting the associations between m¹A sites and diseases. The associations prediction between m¹A modification sites and diseases is a computational approach to predict the relationships between unknown m¹A modification sites and diseases based on experimentally confirmed m¹A modification sites and diseases associations. So far, there has been no specific method for predicting the associations between m¹A sites and diseases. However, there are some methods for predicting the associations between N7-methylguanosine and diseases. m7GDisAI [11] integrates the location information of m⁷G and the comprehensive similarity information of diseases to construct a heterogeneous network, on which matrix decomposition methods are applied to predict potential disease-related m⁷G sites. HN-CNN [12] constructs heterogeneous networks based on m⁷G site similarity, disease similarity, and the associations between m⁷G and diseases to form features of m⁷G site-disease pairs. Then, multidimensional and uncorrelated features are obtained through convolutional neural networks (CNN) [13]. And XGBoost was used to predict the correlation between m⁷G site and disease. BRPCA [14] can accelerate the presence of noise and redundancy in association and similar information. And by introducing appropriate bounded constraints, it ensures that the predicted correlation score is within a meaningful interval. Numerous experiments have demonstrated the superiority and robustness of BRPCA. m7GDP-RW [15] combines the m⁷G sites and diseases feature information with known m⁷G-disease associations to calculate m⁷G similarity and disease similarity. Then, a heterogeneous network of m⁷G- diseases is constructed by combining the known m⁷G-disease associations with the computational similarity of m⁷G and diseases. Finally, a two times random walk algorithm with restart is used to search for new m⁷G-disease associations on the heterogeneous network.

Although there are few cases to solve the problem of predicting RNA methylation modification sites and diseases associations, various models developed in the past few years have made great progress in predicting potential miRNA-disease associations, lncRNA-disease associations, drug targets, etc. The existing methods can be roughly divided into two categories: network-based methods and machine learning-based algorithms.

The network-based algorithm constructs the associated data into a network and uses the model on the network to solve the problem. Tang et al. [16] proposed a double Laplacian regularization matrix completion model for miRNA-disease associations prediction, which transformed miRNA-disease associations prediction into a matrix complementation problem, using double Laplacian regularization terms to make full use of miRNA functional similarity and disease semantic similarity for miRNA-disease associations matrix complementation. The AUC of GLOOCV and LLOOCV were 0.9174 and 0.8289, respectively. Gu et al. [17] developed a global network random walk model to predict the potential lncRNA-disease associations GrwLDA, which can be used for disease with unknown associated lncRNA (isolated disease) and lncRNA with unknown associated disease (novel lncRNA). Wang et al. [18] proposed a weighted graph regularization collaborative non-negative matrix decomposition model to reconstruct the associations adjacency matrix between drugs and diseases by using weighted k-nearest neighbors, and to identify potential associations between drugs and diseases by using a graph regularization non-negative matrix decomposition model.

The machine learning-based approach treats the associations between histological data and diseases as a triadic set of samples, and models the histological data-disease associations problem as a classification problem on the sample set, which is classified by machine learning methods. Chen et al. [19] proposed a Kronecker regularized least squares method based on multiple kernel learning for miRNA-disease associations prediction, introducing Kronecker regularized least squares, which can reveal potential miRNA-disease associations by automatically optimizing the combination of multiple kernels of diseases and miRNAs. Zhang et al. [20] proposed a similar constraint matrix decomposition method for drug-disease associations prediction SCMFDD, which considers the biological background of the problem and uses known drug-disease associations, the drug features and diseases semantic information to project the drug-disease associations into two low-order spaces that reveal the underlying features of the drugs and diseases. Peng et al. [21] proposed to use neural networks for learning based miRNA-disease associations recognition framework, and to construct a three-layer heterogeneous network based on the disease similarity network, miRNA similarity network and protein interaction network to extract association features, and to use automatic encoder for feature dimension reduction. The features after dimension reduction are input into CNN, after which CNN completes the prediction task.

Traditional convolutional neural networks have brought great improvements in the field of text and image processing, but it can only process Euclidean spatial data. Due to the prevalence of graph data, researchers started to focus on how to construct deep learning models on graphs. Bruna et al. [22] first proposed Graph Convolution Network (GCN) and defined graph convolution in spectral space based on convolution theorem. CRPGCN [23] proposed a GCN constructed based on restart random walk (RWR) and principal component analysis (PCA) to predict the associations between circRNA and diseases. The RWR algorithm is used to improve the similarity associations between nodes and their neighbors and the PCA method performs dimension reduction and feature extraction. The GCN algorithm learns the features between circRNAs and the diseases and calculates the final similarity score. Hou et al. [24] used two GCNs (Asso-GCN and Sim-GCN) to extract features of piRNAs and diseases by obtaining association patterns from piRNA-disease interaction networks and two similar networks. GCN captures complex network structure information from the networks and learns to identify features, using fully connected networks and internal yield as output modules to predict piRNA-disease associations scores. Zhang et al. [25] proposed a lncRNA-disease associations prediction model SGALDA based on a semantic and global dual attention mechanism, which divides the constructed heterogeneous network into multiple sub-networks and applies the GCN on each sub-network separately to extract the semantic features of nodes to capture the higher-order interactions on the heterogeneous network.

In this study, we proposed RMDGCN to predict the associations between m¹As and diseases based on positive-unlabeled learning (PU learning) and GCN. We obtain the similarity matrix of m¹A sites through cosine similarity calculation, and the similarity matrix of diseases through disease symptom similarity and semantic similarity. They form a heterogeneous network with the adjacency matrix by PU learning. Finally, a GCN with attention mechanism is applied to the heterogeneous network to predict potential m¹A modification sites related to diseases. In order to verify the effectiveness of RMDGCN, we used a 5-fold cross validation method to compare this method with other methods. The experimental results show that RMDGCN results are significantly superior to other methods and can well predict the potential m¹A sites associated with disease.

Results and discussion

Evaluation metrics

In order to evaluate the performance of the model, we use 5-fold cross validation (5CV) to verify. In the 5CV, all known associations between m¹As and diseases were randomly divided into 5 parts, each part is considered as a test set, and the remaining four parts are considered as training sets. In the test set, we set the known associations between m¹As and diseases in the matrix to 0, and then obtain the predicted scores by RMDGCN. Then the prediction scores are sorted in descending order. We drew the receiver operating characteristic (ROC) curve and the Precision-Recall (PR) curve, and calculated the area under the ROC curve and the area under the PR curve (AUPR) to evaluate the model performance. The ROC curve is obtained by TPR and FPR under different scoring thresholds, and the PR curve is obtained by precision and recall under different scoring thresholds. TPR, FPR, precision and recall are calculated by Eq 1. The AUC is not sensitive to whether the sample category is balanced. The performance under the condition of highly unbalanced data is still too ideal to show the actual situation well. Under extremely unbalanced data (fewer positive samples), PR curve may be more practical than ROC curve. We use AUC and AUPR as the main evaluation indicators.

In addition, we adopt Precision, Recall, Accuracy (ACC) and F1_score to show the results of the model, which are defined as follows respectively: (1) where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

Adjustment of parameters

In RMDGCN, many parameters need to be adjusted, including PU learning threshold, similarity parameter λ, penalty factor α and β, and the embedding dimension. The purpose of PU learning is to obtain more positive samples, but PU learning only obtains predictive scores for the relationships between m¹A modification sites and diseases. In order to obtain the corresponding adjacency matrix, it is necessary to obtain a 0–1 relationship matrix based on the set threshold. In general, we set the classification threshold to 0.5. In order to obtain higher confidence results, we set the threshold = [0.5,0.6,0.7,0.8,0.9]. Then, we compared the results of learning with PU learning and without PU learning, and the results of PU learning were achieved through different thresholds. When not performing PU learning, only the standard adjacency matrix A obtained from the beginning is used. When performing PU learning, a new adjacency matrix is obtained based on the set threshold. If the score of PU learning is greater than the threshold, it is 1, otherwise it is 0. From Fig 1, it can be seen that the results of learning with PU learning are significantly better than those without PU learning. When threshold = 0.6, all results are optimal except for Recall. Therefore, we set the threshold to 0.6. When threshold = 0.6, the positive samples (i.e. = 1) in the adjacency matrix are 11034.

Download:

Fig 1. The results of the different thresholds.

https://doi.org/10.1371/journal.pcbi.1011677.g001

Then we discussed the disease feature fusion parameter λ, and set λ = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]. When λ = 0, disease similarity features are only composed of semantic similarity, and when λ = 1, disease similarity features are composed of symptom similarity. From Fig 2, it can be seen that when λ = 0.9, AUC and AUPR can achieve the maximum value, and the performance of the whole model is optimal. This shows that the symptom characteristics of diseases play a more important role in building heterogeneous networks using disease similarity features.

Download:

Fig 2. The AUCs and AUPRs between different λ.

https://doi.org/10.1371/journal.pcbi.1011677.g002

Penalty factors α and β are mainly used to control the contribution of similarity in GCN propagation. We let α and β both as [0.5,1,2,3,4]. Because the AUCs are not very different, we only show the AUPRs. As a result, RMDGCN (α = 0.5 and β = 0.5) gained the highest AUPR of 0.8682 in 5CV as shown in Fig 3.

Download:

Fig 3. The AUPRs results of α and β.

https://doi.org/10.1371/journal.pcbi.1011677.g003

Finally, we analyzed the impact of embedding dimensions on the experimental results. We set the embedding dimension as [8,16,32,64,128], and the experimental results (see Fig 4) show that the optimal experimental results can be obtained when the embedding dimension is 16.

Download:

Fig 4. The results of embedding dimension.

https://doi.org/10.1371/journal.pcbi.1011677.g004

Comparing with other methods

In order to analyze the performance of RMDGCN in predicting the relationships between m¹A modification sites and diseases, we compared it with the commonly used methods of relationship prediction, RWR and non-negative matrix factorization (NMF), under the five-fold cross-validation. The RWR and NMF used here are basic models without any improvement. In addition, we also compared RMDGCN with existing m⁷G-disease associations prediction method BRPCA [14], drug and disease prediction methods SCMDDF [20], and miRNA-disease associations prediction method LOMDA [26]. RWR can use the multi-faceted information of nodes to obtain the correlation score between nodes by capturing the global information of the graph. Since the same molecule may cause many different diseases, we believe that the potential factors controlling the molecular disease associations are highly correlated. NMF decomposes the association matrix into base matrix and weight matrix. The base matrix describes the relative contribution of potential regulatory factors, and the weight matrix describes the relative contribution of diseases. The predicted score of the relationships between the modified sites and the diseases is obtained by multiplying the two matrices. BRPCA performs singular value decomposition on heterogeneous networks to recover missing items in the adjacency matrix of heterogeneous networks. SCMDDF is a method of using similarity constraint matrix factorization to predict the relationships between drugs and diseases, while LOMDA is a linear prediction method based on linear optimization methods for predicting the associations between miRNA and diseases.

The ROC curves and PR curves of the several methods are shown in the Fig 5. It can be seen from the figure that RMDGCN performs best in predicting the relationships between m¹A modification sites and diseases, which is 0.053 higher than the AUC of BRPCA, 0.0904 higher than SCMFDD, and 0.0955 higher than LOMDA. In addition, AUPR is 0.704 higher than BRPCA and 0.543 higher than LOMDA. Although the ACC of RMDGCN is lower than NMF and SCMFDD, and the Recall is lower than LOMFDA and BRPCA, all indicators of RMDGCN are relatively balanced and high. It proves that the performance of RMDGCN is superior to other methods. The detailed results of all evaluation indicators are shown in Table 1.

Download:

Fig 5. The ROC curves and PR curves for 5CV.

(A) ROC curves (B) PR curves.

https://doi.org/10.1371/journal.pcbi.1011677.g005

Download:

Table 1. Performance comparison of different methods.

https://doi.org/10.1371/journal.pcbi.1011677.t001

Case study

In order to further verify the performance of the model, RMDGCN was used to predict the potential m¹A sites associated with breast cancer, which is a disease with a very high prevalence rate among women worldwide. We further performed gene ontology (GO) analysis on the host genes of these predicted new sites.

In case study, all known associations between the breast cancer and m¹As were assumed to be unknown. Then the prediction scores were calculated by RMDGCN between the breast cancer and m¹A modifications. According to the prediction scores, the candidate m¹A sites related to breast cancer will be sorted, and the top 500 candidate sites will be selected, and their host genes will be found. Because some sites have common host genes, we finally obtained 174 host genes for GO analysis, then, by removing redundancy, we obtained 150 host genes with unique gene symbols. GO analysis includes cell composition (CC), biological process (BP) and molecular function (MF). Then, DAVID describes these host genes from these three aspects. Among them, the p-value cutoff is set to 0.05. In order to control the error detection rate, the adjusted p-value is set to 0.1. When this condition is met, it indicates that the correlation between host gene and GO terms is statistically significant.

Ana et al. [27] showed that nucleolar protein was overexpressed in breast tumors, mainly in the nucleus, but cytoplasmic staining was observed in some cells. The CC enrichment term of genes related to breast cancer is shown in the Fig 6. The larger the circle, the richer the gene in this term. It can be seen that many genes related to breast cancer disease are enriched in the nucleus and cytoplasm, indicating that these sites may participate in the formation of nucleus and cytoplasm.

Download:

Fig 6. The CC enrichment term of genes related to breast cancer.

https://doi.org/10.1371/journal.pcbi.1011677.g006

Research shows that the occurrence of breast cancer is related to a variety of biological processes. We analyzed the relationship between the BP terms and the enriched host genes, and the results are shown in the Fig 7. It can be seen from the figure that a gene may participate in multiple biological processes, and a biological process is the result of the interaction of multiple genes. For example, CHD5 [28] gene can inhibit the proliferation of breast cancer cells in vitro and slow down the process of cell cycle. Missense mutation in LONP1 and gene mutation in TRMT61B [29] are related to m¹A modification level in a large number of tissues. The genetic variation related to the level of RNA modification is related to a variety of disease related phenotypes, including blood pressure, breast cancer and psoriasis, suggesting the role of mitochondrial RNA modification in complex diseases.

Download:

Fig 7. The relationship between the BP terms and the enriched host genes.

https://doi.org/10.1371/journal.pcbi.1011677.g007

Next, we analyzed the MF enriched by GO, and the results are shown in the figure. It can be seen from Fig 8 that GO is mainly enriched in protein binding on MF. For example, p53 binding plays an important role in the repair of radiation-induced DNA double strand breaks (DSBs) [30]. As a tumor suppressor, it has a far-reaching effect on inhibiting breast cancer, and has become an important new biomarker to judge the prognosis of breast cancer.

Download:

Fig 8. The MF enrichment term of genes related to breast cancer.

https://doi.org/10.1371/journal.pcbi.1011677.g008

Conclusion

In this study, a RMDGCN method was developed to predict the associations between the m¹A sites and various diseases, aiming to reveal the regulatory pathways of diseases regulated by the epitranscriptome layer. Using disease related mutations as links, the associations between m¹A sites and diseases were extracted, and a m¹A-disease associations network was constructed. In order to further improve the accuracy of prediction, PU learning is used to increase the number of associations.

RMDGCN is conducted on a heterogeneous network of m¹A-disease associations, which is composed of m¹A similarity network, disease similarity network, and m¹A-disease associations network. In terms of predictive performance, the AUC of RMDGCN in 5CVs is as high as 0.9892, and the AUPR is as high as 0.8682. In addition, in the case study of breast cancer, we verified the effectiveness of RMDGCN through GO analysis. In summary, RMDGCN is a powerful tool for predicting the associations between methylation modification sites and diseases.

Material and methods

The whole flowchart of RMDGCN is shown in Fig 9.

Download:

Fig 9. The flowchart of RMDGCN.

https://doi.org/10.1371/journal.pcbi.1011677.g009

Dataset

The associations between m¹As and diseases used by RMDGCN come from the RMVar database(http://rmvar.renlab.org) [31], which is specially used to collect RNA data of functional variants, and is designed to help reveal the potential function of RNA modified variants. In order to link RM-related variants with human genetic diseases, RMVar collected disease-related variants from GWAS queue and ClinVar database. In RMVar, we downloaded the dataset which is human related m¹A information with “genetype = m¹A” and “assembly = hg38”, including 62285 mutation related m¹A methylation modification sites information. In the dataset, it includes modification sites location information, corresponding SNP information, tumor disease information, etc.

Known associations between RNA Methylations and diseases

Due to the associations prediction of m¹A modifications and diseases, we only focus on the sites information of m¹A and the corresponding tumor diseases information. In the dataset we downloaded, it collected 62285 mutations related to m¹A methylation modification sites and 424 diseases. In order to make the data have a higher confidence level, we reserve the modified sites with a high confidence level and diseases with corresponding DOID in the Disease ontology database (http://www.disease-ontology.org/) [32]. Finally, we obtained 3618 m¹A modification sites, 116 diseases and 5100 associations. For convenience, we express the associations between m¹A modification sites and diseases as a binary matrix A∈R^3618×116. If there is interaction between m¹A modification site r_i and disease d_j, A (i, j) = 1; Otherwise, A (i, j) = 0.

Similarity calculation

m¹A RNA methylation similarity.

In order to calculate the similarity of m¹A modification, we first take the modification sites as the center, and lengthen the upstream and downstream by 250 bp. For each modification site, we finally obtain a sequence with the length of 501 bp based on the four AGCT bases. Word2vec [33] is a statistical method for learning word embedding. It can express a word into vector form quickly and effectively through the optimized training model according to the given corpus. It builds word embedding from the text corpus based on neural network training through two models, skip-gram and continuous Word Packet (CBOW). Skip-gram model can predict the surrounding words through the current words, while CBOW model can predict the current words through the context words. Both models focus on learning the context within the adjacent word window of a given word. We regard the length of three RNA nucleotides as an RNA word in the way of sliding window, and analyze their sequence content as an RNA corpus, and then use word2vec in the Gensim toolkit to learn the vector relationship of these RNA words, and finally generate a 100-dimensional feature vector for each sequence [34]. In the end, the sequences of all modified sites are converted into a data set r = R^n×e, where n is the number of m¹A methylation sites, and e is the dimensional of feature vector by word2vec.

Cosine similarity [35] is a measure of the difference between two individuals which uses the cosine value of the angle between two vectors in vector space. Compared with distance measurement, cosine similarity pays more attention to the difference between two vectors in the direction rather than the distance or the length. In order to measure the similarity of the two modification sites and construct the m¹A-m¹A similarity matrix, we use cosine similarity to calculate the similarity of the two m¹A modification sites. The similarity between m¹A r_i and m¹A r_j can be defined as: (2) where RRS(r_i, r_j) is the similarity between m¹A r_i and m¹A r_j, and the dimensions of similarity matrix RRS are 3618×3618, ‖•‖ denotes the L2 norm. Due to the range of cosine similarity being [–1,1], in order to keep the RRS similarity matrix within the range of [0,1], min-max normalization [36] is used to normalize the RRS similarity matrix.

Disease similarity

RMDGCN constructs a disease-disease similarity network by calculating the semantic similarity and symptom similarity of diseases. The semantic similarity of diseases is calculated by using the DOID corresponding to diseases in the DOSE package of R language, and is recorded as DOS. According to the work of Zhou et al. [37], we can measure the disease similarity and build a symptom-based disease similarity network. Here, the symptom-based disease similarity matrix DDS is obtained from the disease symptom profile. Finally, the integrated disease similarity is regarded as the disease feature. The integrated disease similarity is calculated as follows: (3) where DOS(d_i, d_j) is the semantic similarity between disease d_i and d_j, DSS(d_i, d_j) is the symptom similarity between disease d_i and d_j, DDS(d_i, d_j) is the integrated similarity between disease d_i and d_j, λ is an adjusting parameter. The dimensions of similarity matrix DDS are 116×116.

Positive-unlabeled learning

When the positive and negative samples are extremely unbalanced and the number of negative samples is far more than the positive samples, the effect of the model is very poor. In other words, the performance of the algorithm depends largely on a large number of known m¹A modification sites and diseases associations. However, currently the known m¹A modification sites and diseases associations are few, which means that the adjacency matrix A contains a large number of 0 elements. This will affect the performance of the model to a certain extent. In order to reduce the impact of sample imbalance caused by few positive samples on model performance, RMDGCN uses positive unlabeled learning to increase the number of positive samples to improve the predictive performance of the model.

We use RRS and DOS as features, randomly select the same number as the positive samples from the negative samples each time, and combine them with all positive samples to form training samples. Then the remaining negative samples are predicted to get the prediction score by XGBoost [38]. The detailed process is shown in Fig 10. We repeat the process (See Fig 10) 100 times. After repeating the process 100 times, for each negative sample, the average score of all predicted results is taken as the final score of the negative sample, while the score of the positive sample is still 1. Sample labels are confirmed based on a given threshold. If the score is greater than the threshold, then ; otherwise, . Thus, a new adjacency matrix is constructed.

Download:

Fig 10. The flowchart of PU-learning.

https://doi.org/10.1371/journal.pcbi.1011677.g010

Heterogeneous networks construction

Based on RRS, DDS and the new adjacency matrix after PU learning, the adjacency matrix representing heterogeneous networks is defined as follows: (4) where X is the m¹A-disease heterogeneous matrix, RRS is the similarity matrix between sites, DDS is the similarity matrix between diseases and diseases, is the adjacency matrix after PU learning, and is the transposition of .

In order to construct a low dimensional representation of the relationships between m¹As and diseases using GCN, we introduced penalty factors α and β on heterogeneous graph X to control the contribution of similarity in GCN propagation, and deployed GCN to combine node similarity and m¹A-disease associations information.

(5)

Graph convolution network with multilayer attention mechanism

Due to the ability of GCN to effectively extract spatial features of topology maps in non Euclidean structures, we choose GCN to extract features in the constructed heterogeneous network. There are two main types of graph convolutional neural networks, one based on spatial or vertex domains, and the other based on frequency or spectral domains. The essence of spatial graph convolution is to continuously aggregate the neighboring information of nodes, that is, directly accumulate the neighboring information of nodes to achieve graph convolution. Spectral domain based graph convolutional networks mainly rely on graph theory, correlated Fourier transform to achieve mutual conversion between spatial and spectral domains, and convolutional operation properties in spectral domains to complete the research of topological graph properties. In spectral-based graph neural networks, graphs are assumed to be undirected graphs, and one robust mathematical representation of undirected graphs is the regularized graph Laplacian matrix: (6) where A is the adjacency matrix of the graph, D is the diagonal matrix and (7)

Regularized graph Laplacian matrix has the property of being real symmetry semi-positive definite. Using this property, the regularized Laplacian matrix can be decomposed into (8) where , U is a matrix consisting of the eigenvectors of L, and Λ is a diagonal matrix, and the values on the diagonal matrix are the eigenvalues of L. The eigenvectors of the regularized Laplacian matrix form a set of orthogonal bases.

In graph signal processing, the signal x∈R^N of a graph is an eigenvector consisting of individual nodes of the graph, with x_i is thus defined as (9)

The inverse Fourier transform is: (10) where is the result of the Fourier transform.

In processing graphs, the discrete form of the Fourier transform is used. Since the Laplace matrix, after spectral decomposition, yields n linearly independent eigenvectors that form a set of orthogonal bases in space, the eigenvectors of the normalized Laplace matrix operator form the basis of the graph Fourier transform. The graph Fourier transform projects the signal of the input graph into the orthogonal space, which is equivalent to representing any vector defined on the graph as a linear combination of the eigenvectors of the Laplace matrix.

Specifically, given a network with the corresponding adjacency matrix G, the hierarchical propagation rule of the GCN is: (11) where H^(l) is the embedding of the node at layer l, D = diag(∑_j G_ij) is the degree matrix of G, W^(l) is the layer-specific trainable weight matrix, σ(•) is the nonlinear activation function, and is the normalized adjacency matrix. The left multiplication of the normalized adjacency matrix by the feature matrix is the aggregation process of neighbors, and then the right multiplication of W is the feature weighted summation process, and finally the nonlinear transformation is performed by the activation function.

In addition, considering that the contributions of different embeddings at different levels are inconsistent, we introduce attention mechanisms to combine these embeddings and ultimately obtain methylation modifications and disease embeddings: (12) where Z_R is the final m¹A methylation modification embedding, Z_D is the final disease embedding, a is the weight of each layer.

After the encoder processing, we obtain the potential feature vectors of m¹As and diseases. Then we construct the m¹A-disease network based on the potential feature vector Z_R and Z_D, and expressed the predicted probability of the associations as follow: (13) where is the reconstructed adjacency matrix, in which the element represents the predicted score of relationship between the ith m¹A and the jth disease. W^s is a trainable weight matrix. The detailed procedures of using GCN with multilayer attention mechanism to predict the associations between m¹As and diseases are shown in Fig 11.

Download:

Fig 11. The detailed procedures of using GCN to predict the associations between m¹As and diseases.

https://doi.org/10.1371/journal.pcbi.1011677.g011

Model training

RMDGCN consists of a multi-source heterogeneous network construction module and a GCN module. We obtained relevant information on the associations between m¹As and diseases, using known associations as positive samples and unknown associations as negative samples. We divide it into five parts, with four parts serving as the training set and the rest part as the testing set.

Then we use the weighted cross entropy as the loss function: (14) where ij represents the relationship pair between the ith m¹A and the jth disease, y_ij represents the true label of the relationship between the ith m¹A and the jth disease, represents the predictive score of the relationship between the ith m¹A and the jth disease, and ϑ represents the cross entropy weight, which equals to the ratio of negative samples to positive samples.

To minimize the loss function, we use the Xavier method to initialize all the trainable weight matrices, and use the Adam optimizer [39] to minimize the loss function. The Adam optimizer can update the weights of the network according to the training data. We added a discard layer to avoid overfitting to achieve regularization effect [40]. In addition, we improve the convergence effect by cycling the learning rate [41] between the maximum and minimum values.

Supporting information

S1 Data. The information for 3618 m¹A methylation sites.

https://doi.org/10.1371/journal.pcbi.1011677.s001

(XLSX)

S2 Data. The known associations between m¹A methylation sites and diseases.

https://doi.org/10.1371/journal.pcbi.1011677.s002

(XLSX)

S3 Data. The top 500 candidate m¹A methylation sites related to brease cancer by RMDGCN.

https://doi.org/10.1371/journal.pcbi.1011677.s003

(XLSX)

S1 Table. The DOID for 116 diseases.

https://doi.org/10.1371/journal.pcbi.1011677.s004

(XLSX)

S2 Table. The 150 host genes by GO analysis.

https://doi.org/10.1371/journal.pcbi.1011677.s005

(XLSX)

References

1. Dunn DB The occurence of 1-methyladenine in ribonucleic acid. Biochimica et Biophysica Acta (1961) 46: 198–200.
- View Article
- Google Scholar
2. Dominissini D, Nachtergaele S, Moshitch-Moshkovitz S, Peer E, Kol N, et al. The dynamic N-1-methyladenosine methylome in eukaryotic messenger RNA. Nature (2016) 530: 441–446. pmid:26863196
- View Article
- PubMed/NCBI
- Google Scholar
3. Safra M, Sas-Chen A, Nir R, Winkler R, Nachshon A, et al. The m(1)A landscape on cytosolic and mitochondrial mRNA at single-base resolution. Nature (2017) 551: 251–255. pmid:29072297
- View Article
- PubMed/NCBI
- Google Scholar
4. Xiaoyu Li, Xushen Xiong, Meiling, et al. Base-Resolution Mapping Reveals Distinct m1A Methylome in Nuclear- and Mitochondrial-Encoded Transcripts. Molecular cell. (2017)
- View Article
- Google Scholar
5. Peifer C, Sharma S, Watzinger P, Entian KD Yeast Rrp8p, a novel methyltransferase responsible for mA 645 base modification of 25S rRNA. Nucleic Acids Research (2012) 41: 1151–1163.
- View Article
- Google Scholar
6. Ballesta JP, Cundliffe E Site-specific methylation of 16S rRNA caused by pct, a pactamycin resistance determinant from the producing organism, Streptomyces pactum. Journal of Bacteriology (1991) 173: 7213–7218. pmid:1657884
- View Article
- PubMed/NCBI
- Google Scholar
7. Chan CTY, Dyavaiah M, Demott MS, Taghizadeh K, Dedon PC, et al. A Quantitative Systems Approach Reveals Dynamic Control of tRNA Modifications during Cellular Stress. PLoS Genetics (2010) 6: e1001247. pmid:21187895
- View Article
- PubMed/NCBI
- Google Scholar
8. Ramaswami G, Deng P, Zhang R, Anna Carbone M, Mackay TFC, et al. Genetic mapping uncovers cis-regulatory landscape of RNA editing. Nature Communications (2015) 6: 8194. pmid:26373807
- View Article
- PubMed/NCBI
- Google Scholar
9. Zhou W, Wang C, Chang J, Huang Y, Wu P RNA Methylations in Cardiovascular Diseases, Molecular Structure, Biological Functions and Regulatory Roles in Cardiovascular Diseases. Frontiers in Pharmacology (2021) 12: 722728. pmid:34489709
- View Article
- PubMed/NCBI
- Google Scholar
10. Liang Y, Zhan G, Chang KJ, Yang YP, Hsu CH The roles of m6A RNA modifiers in human cancer. Journal of the Chinese Medical Association (2020) 83: 221–226. pmid:31904662
- View Article
- PubMed/NCBI
- Google Scholar
11. Ma J, Zhang L, Chen J, Song B, Liu H mGDisAI: N7-methylguanosine (mG) sites and diseases associations inference based on heterogeneous network. BMC Bioinformatics (2021) 22: 152.
- View Article
- Google Scholar
12. Zhang L, Chen J, Ma J, Liu H HN-CNN: A Heterogeneous Network Based on Convolutional Neural Network for m7 G Site Disease Association Prediction. Frontiers in Genetics (2021) 12: 655284. pmid:33747055
- View Article
- PubMed/NCBI
- Google Scholar
13. Shin HC, Roth HR, Gao M, Lu L, Xu Z, et al. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Transactions on Medical Imaging (2016) 35: 1285–1298. pmid:26886976
- View Article
- PubMed/NCBI
- Google Scholar
14. Jiani Ma LZ, Shaojie Li, and Hui Liu BRPCA: Bounded Robust Principal Component Analysis Method to Incorporate Similarity Network for N7-methyguanosine (m7G) Sites and Diseases Association Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (2022) 19: 3295–3306.
- View Article
- Google Scholar
15. Huang Y, Wu Z, Lan W, Zhong C Predicting disease-associated N7-methylguanosine(m7G) sites via random walk on heterogeneous network. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2023. pmid:37294648
- View Article
- PubMed/NCBI
- Google Scholar
16. Tang C, Zhou H, Zheng X, Zhang Y, Sha X Dual Laplacian regularized matrix completion for microRNA-disease associations prediction. RNA biology (2019) 16: 601–611. pmid:30676207
- View Article
- PubMed/NCBI
- Google Scholar
17. Gu C, Liao B, Li X, Cai L, Li Z, et al. Global network random walk for predicting potential human lncRNA-disease associations. Scientific Reports (2017) 7: 12442. pmid:28963512
- View Article
- PubMed/NCBI
- Google Scholar
18. Wang MN, Xie XJ, You ZH, Ding DW, Wong L A weighted non-negative matrix factorization approach to predict potential associations between drug and disease. J Transl Med (2022) 20: 552. pmid:36463215
- View Article
- PubMed/NCBI
- Google Scholar
19. Chen X, Niu YW, Wang GH, Yan GY MKRMDA: multiple kernel learning-based Kronecker regularized least squares for MiRNA–disease association prediction. J Transl Med (2017) 15. pmid:29233191
- View Article
- PubMed/NCBI
- Google Scholar
20. Zhang W, Yue X, Lin WR, Wu WJ, Liu RQ, et al. Predicting drug-disease associations by using similarity constrained matrix factorization. Bmc Bioinformatics (2018) 19: 1–14.
- View Article
- Google Scholar
21. Peng JJ, Hui WW, Li QQ, Chen BL, Hao JY, et al. A learning-based framework for miRNA-disease association identification using neural networks. Bioinformatics (2019) 35: 4364–4371. pmid:30977780
- View Article
- PubMed/NCBI
- Google Scholar
22. Point O, Related S Spectral Networks and Deep Locally Connected Networks on Graphs Proc of the 2nd International Conference on Learning Representations Banff, Canada: ICLR: (2014) 1–14.
- View Article
- Google Scholar
23. Ma ZH, Kuang ZF, Deng L CRPGCN: predicting circRNA-disease associations using graph convolutional network based on heterogeneous network. Bmc Bioinformatics (2021) 22: 1–23.
- View Article
- Google Scholar
24. Hou JL, Wei H, Liu B iPiDA-GCN: Identification of piRNA-disease associations based on Graph Convolutional Network. Plos Computational Biology (2022) 18: e1010671. pmid:36301998
- View Article
- PubMed/NCBI
- Google Scholar
25. Yi Zhang GC, Zhenmei Wang Long non-coding RNA-disease association prediction model based on semantic and global dual attention mechanism. Journal of Computer Applications: (2023) 1–10.
- View Article
- Google Scholar
26. Pech R LOMDA: Linear optimization for miRNA-disease association prediction: bioRxiv preprint. (2019)
- View Article
- Google Scholar
27. Gregório A, Lacerda M, Figueiredo P, Simes S, Dias S, et al. Meeting the needs of breast cancer: A nucleolin’s perspective. Critical Reviews in Oncology/hematology (2018) 125: 89–101. pmid:29650282
- View Article
- PubMed/NCBI
- Google Scholar
28. Bagchi A, Papazoglu C, Ying W, Capurso D, Brodt M, et al. Cell- Comments on CHD5 Is a Tumor Suppressor at Human 1p36. Current Biology: 459–475.
- View Article
- Google Scholar
29. Ali AT, Idaghdour Y, Hodgkinson A Analysis of mitochondrial m1A/G RNA modification reveals links to nuclear genetic variants and associated disease processes. Communications Biology 3: 147. pmid:32221480
- View Article
- PubMed/NCBI
- Google Scholar
30. Huo Q, Yang Q P53-binding protein 1: a new player for tumorigenesis and a new target for breast cancer treatment. Medical Hypotheses (2011) 77: 359–363. pmid:21641119
- View Article
- PubMed/NCBI
- Google Scholar
31. Luo X, Li H, Liang J, Zhao Q, Xie Y, et al. RMVar: an updated database of functional variants involved in RNA modifications. Nucleic Acids Research: (2020) D1.
- View Article
- Google Scholar
32. Kibbe WA, Arze C, Felix V, Mitraka E, Schriml LM Disease Ontology 2015 update: An expanded and updated database of Human diseases for linking biomedical knowledge through disease data. Nucleic Acids Research (2014) 43: D1. pmid:25348409
- View Article
- PubMed/NCBI
- Google Scholar
33. Le QV, Mikolov T Distributed Representations of Sentences and Documents. arXiv e-prints. (2014)
- View Article
- Google Scholar
34. Zou Q Sr., Xing P, Wei L, Liu B Gene2vec: Gene Subsequence Embedding for Prediction of Mammalian N6-Methyladenosine Sites from mRNA. RNA (2019) 25: 205–218. pmid:30425123
- View Article
- PubMed/NCBI
- Google Scholar
35. Lahitani AR, Permanasari AE, Setiawan NA. Cosine similarity to determine similarity measure: Study case in online essay assessment; 2016.
- View Article
- Google Scholar
36. Xiaojiang J, Sanbao D, Guandong W Using Min-Max Normalization to Measure the Differences of Regional Economic Growth——A Case Study of Yulin Area,Shanxi Province. Economy and Management. (2016)
- View Article
- Google Scholar
37. Zhou XZ, Menche J, Barabasi AL, Sharma A Human symptoms-disease network. Nature Communications (2014) 5: 4212. pmid:24967666
- View Article
- PubMed/NCBI
- Google Scholar
38. Chen T, Guestrin C XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: (2016) 785–794.
- View Article
- Google Scholar
39. Kingma D, Ba J Adam: A Method for Stochastic Optimization. Computer Science: (2014) 2014.arXiv.1412.6980.
- View Article
- Google Scholar
40. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research (2014) 15: 1929–1958.
- View Article
- Google Scholar
41. Smith L. Cyclical Learning Rates for Training Neural Networks. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV): (2017) 464–472.
- View Article
- Google Scholar

[ref1] 1. Dunn DB The occurence of 1-methyladenine in ribonucleic acid. Biochimica et Biophysica Acta (1961) 46: 198–200.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Dominissini D, Nachtergaele S, Moshitch-Moshkovitz S, Peer E, Kol N, et al. The dynamic N-1-methyladenosine methylome in eukaryotic messenger RNA. Nature (2016) 530: 441–446. pmid:26863196
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Safra M, Sas-Chen A, Nir R, Winkler R, Nachshon A, et al. The m(1)A landscape on cytosolic and mitochondrial mRNA at single-base resolution. Nature (2017) 551: 251–255. pmid:29072297
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Xiaoyu Li, Xushen Xiong, Meiling, et al. Base-Resolution Mapping Reveals Distinct m1A Methylome in Nuclear- and Mitochondrial-Encoded Transcripts. Molecular cell. (2017)
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Peifer C, Sharma S, Watzinger P, Entian KD Yeast Rrp8p, a novel methyltransferase responsible for mA 645 base modification of 25S rRNA. Nucleic Acids Research (2012) 41: 1151–1163.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref6] 6. Ballesta JP, Cundliffe E Site-specific methylation of 16S rRNA caused by pct, a pactamycin resistance determinant from the producing organism, Streptomyces pactum. Journal of Bacteriology (1991) 173: 7213–7218. pmid:1657884
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref7] 7. Chan CTY, Dyavaiah M, Demott MS, Taghizadeh K, Dedon PC, et al. A Quantitative Systems Approach Reveals Dynamic Control of tRNA Modifications during Cellular Stress. PLoS Genetics (2010) 6: e1001247. pmid:21187895
View Article
PubMed/NCBI
Google Scholar

[23] View Article

[24] PubMed/NCBI

[25] Google Scholar

[ref8] 8. Ramaswami G, Deng P, Zhang R, Anna Carbone M, Mackay TFC, et al. Genetic mapping uncovers cis-regulatory landscape of RNA editing. Nature Communications (2015) 6: 8194. pmid:26373807
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref9] 9. Zhou W, Wang C, Chang J, Huang Y, Wu P RNA Methylations in Cardiovascular Diseases, Molecular Structure, Biological Functions and Regulatory Roles in Cardiovascular Diseases. Frontiers in Pharmacology (2021) 12: 722728. pmid:34489709
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref10] 10. Liang Y, Zhan G, Chang KJ, Yang YP, Hsu CH The roles of m6A RNA modifiers in human cancer. Journal of the Chinese Medical Association (2020) 83: 221–226. pmid:31904662
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref11] 11. Ma J, Zhang L, Chen J, Song B, Liu H mGDisAI: N7-methylguanosine (mG) sites and diseases associations inference based on heterogeneous network. BMC Bioinformatics (2021) 22: 152.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref12] 12. Zhang L, Chen J, Ma J, Liu H HN-CNN: A Heterogeneous Network Based on Convolutional Neural Network for m7 G Site Disease Association Prediction. Frontiers in Genetics (2021) 12: 655284. pmid:33747055
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref13] 13. Shin HC, Roth HR, Gao M, Lu L, Xu Z, et al. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Transactions on Medical Imaging (2016) 35: 1285–1298. pmid:26886976
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref14] 14. Jiani Ma LZ, Shaojie Li, and Hui Liu BRPCA: Bounded Robust Principal Component Analysis Method to Incorporate Similarity Network for N7-methyguanosine (m7G) Sites and Diseases Association Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (2022) 19: 3295–3306.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref15] 15. Huang Y, Wu Z, Lan W, Zhong C Predicting disease-associated N7-methylguanosine(m7G) sites via random walk on heterogeneous network. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2023. pmid:37294648
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref16] 16. Tang C, Zhou H, Zheng X, Zhang Y, Sha X Dual Laplacian regularized matrix completion for microRNA-disease associations prediction. RNA biology (2019) 16: 601–611. pmid:30676207
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref17] 17. Gu C, Liao B, Li X, Cai L, Li Z, et al. Global network random walk for predicting potential human lncRNA-disease associations. Scientific Reports (2017) 7: 12442. pmid:28963512
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref18] 18. Wang MN, Xie XJ, You ZH, Ding DW, Wong L A weighted non-negative matrix factorization approach to predict potential associations between drug and disease. J Transl Med (2022) 20: 552. pmid:36463215
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref19] 19. Chen X, Niu YW, Wang GH, Yan GY MKRMDA: multiple kernel learning-based Kronecker regularized least squares for MiRNA–disease association prediction. J Transl Med (2017) 15. pmid:29233191
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref20] 20. Zhang W, Yue X, Lin WR, Wu WJ, Liu RQ, et al. Predicting drug-disease associations by using similarity constrained matrix factorization. Bmc Bioinformatics (2018) 19: 1–14.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref21] 21. Peng JJ, Hui WW, Li QQ, Chen BL, Hao JY, et al. A learning-based framework for miRNA-disease association identification using neural networks. Bioinformatics (2019) 35: 4364–4371. pmid:30977780
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref22] 22. Point O, Related S Spectral Networks and Deep Locally Connected Networks on Graphs Proc of the 2nd International Conference on Learning Representations Banff, Canada: ICLR: (2014) 1–14.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref23] 23. Ma ZH, Kuang ZF, Deng L CRPGCN: predicting circRNA-disease associations using graph convolutional network based on heterogeneous network. Bmc Bioinformatics (2021) 22: 1–23.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref24] 24. Hou JL, Wei H, Liu B iPiDA-GCN: Identification of piRNA-disease associations based on Graph Convolutional Network. Plos Computational Biology (2022) 18: e1010671. pmid:36301998
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref25] 25. Yi Zhang GC, Zhenmei Wang Long non-coding RNA-disease association prediction model based on semantic and global dual attention mechanism. Journal of Computer Applications: (2023) 1–10.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref26] 26. Pech R LOMDA: Linear optimization for miRNA-disease association prediction: bioRxiv preprint. (2019)
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref27] 27. Gregório A, Lacerda M, Figueiredo P, Simes S, Dias S, et al. Meeting the needs of breast cancer: A nucleolin’s perspective. Critical Reviews in Oncology/hematology (2018) 125: 89–101. pmid:29650282
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref28] 28. Bagchi A, Papazoglu C, Ying W, Capurso D, Brodt M, et al. Cell- Comments on CHD5 Is a Tumor Suppressor at Human 1p36. Current Biology: 459–475.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref29] 29. Ali AT, Idaghdour Y, Hodgkinson A Analysis of mitochondrial m1A/G RNA modification reveals links to nuclear genetic variants and associated disease processes. Communications Biology 3: 147. pmid:32221480
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref30] 30. Huo Q, Yang Q P53-binding protein 1: a new player for tumorigenesis and a new target for breast cancer treatment. Medical Hypotheses (2011) 77: 359–363. pmid:21641119
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref31] 31. Luo X, Li H, Liang J, Zhao Q, Xie Y, et al. RMVar: an updated database of functional variants involved in RNA modifications. Nucleic Acids Research: (2020) D1.
View Article
Google Scholar

[111] View Article

[112] Google Scholar

[ref32] 32. Kibbe WA, Arze C, Felix V, Mitraka E, Schriml LM Disease Ontology 2015 update: An expanded and updated database of Human diseases for linking biomedical knowledge through disease data. Nucleic Acids Research (2014) 43: D1. pmid:25348409
View Article
PubMed/NCBI
Google Scholar

[114] View Article

[115] PubMed/NCBI

[116] Google Scholar

[ref33] 33. Le QV, Mikolov T Distributed Representations of Sentences and Documents. arXiv e-prints. (2014)
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref34] 34. Zou Q Sr., Xing P, Wei L, Liu B Gene2vec: Gene Subsequence Embedding for Prediction of Mammalian N6-Methyladenosine Sites from mRNA. RNA (2019) 25: 205–218. pmid:30425123
View Article
PubMed/NCBI
Google Scholar

[121] View Article

[122] PubMed/NCBI

[123] Google Scholar

[ref35] 35. Lahitani AR, Permanasari AE, Setiawan NA. Cosine similarity to determine similarity measure: Study case in online essay assessment; 2016.
View Article
Google Scholar

[125] View Article

[126] Google Scholar

[ref36] 36. Xiaojiang J, Sanbao D, Guandong W Using Min-Max Normalization to Measure the Differences of Regional Economic Growth——A Case Study of Yulin Area,Shanxi Province. Economy and Management. (2016)
View Article
Google Scholar

[128] View Article

[129] Google Scholar

[ref37] 37. Zhou XZ, Menche J, Barabasi AL, Sharma A Human symptoms-disease network. Nature Communications (2014) 5: 4212. pmid:24967666
View Article
PubMed/NCBI
Google Scholar

[131] View Article

[132] PubMed/NCBI

[133] Google Scholar

[ref38] 38. Chen T, Guestrin C XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: (2016) 785–794.
View Article
Google Scholar

[135] View Article

[136] Google Scholar

[ref39] 39. Kingma D, Ba J Adam: A Method for Stochastic Optimization. Computer Science: (2014) 2014.arXiv.1412.6980.
View Article
Google Scholar

[138] View Article

[139] Google Scholar

[ref40] 40. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research (2014) 15: 1929–1958.
View Article
Google Scholar

[141] View Article

[142] Google Scholar

[ref41] 41. Smith L. Cyclical Learning Rates for Training Neural Networks. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV): (2017) 464–472.
View Article
Google Scholar

[144] View Article

[145] Google Scholar

Figures

Abstract

Author summary

Introduction

Results and discussion

Evaluation metrics

Adjustment of parameters

Comparing with other methods

Case study

Conclusion

Material and methods

Dataset

Known associations between RNA Methylations and diseases

Similarity calculation

m1A RNA methylation similarity.

Disease similarity

Positive-unlabeled learning

Heterogeneous networks construction

Graph convolution network with multilayer attention mechanism

Model training

Supporting information

S1 Data. The information for 3618 m1A methylation sites.

S2 Data. The known associations between m1A methylation sites and diseases.

S3 Data. The top 500 candidate m1A methylation sites related to brease cancer by RMDGCN.

S1 Table. The DOID for 116 diseases.

S2 Table. The 150 host genes by GO analysis.

References

m¹A RNA methylation similarity.

S1 Data. The information for 3618 m¹A methylation sites.

S2 Data. The known associations between m¹A methylation sites and diseases.

S3 Data. The top 500 candidate m¹A methylation sites related to brease cancer by RMDGCN.