Introduction

An impartial panel of scientists known as TAG-VE (Technical Advisory Group on SARS-CoV-2 Virus Evolution) regularly observes and examines the appraise of the SARS-CoV-2 virus to determine if specific mutations or combinations of mutations have an impact on the behavior of the virus. The B.1.1.529 variants of SARS-CoV-2 were the subject of an evaluation by the TAG-VE on November 26, 2021. South Africa disclosed the B.1.1.529 version on November 24, 2021 to World Health Organization (WHO) [26] for the first time. The WHO has classified B.1.1.529 as a VOC under the name Omicron considering the data that a negative shift in COVID-19 epidemiology has occurred. Similar to other SARS-CoV-2 variations, there are numerous lineages and sublineages in the Omicron variation. Omicron presently has 3 main lineages: BA.5, BA.4, and BA.2. Omicron Pango Lineage currently has six different variations or sublineages (BA.5, BA.4, BA.3, BA.2, BA.1.1, BA.1). Although these lineages are frequently extremely similar to one another, there may be variations between lineages that influence how the virus behaves. In our research, we have created seven PPI networks of Omicron Pango Lineage including all the variants. The network has been created on STRING, analyzed the network and find the most influential proteins from the network. The networks that describe the interactions between the parts of such complex systems are easier to analyze than it is to investigate each component separately. The placement of some significant or influential elements in most networks such as crucial proteins in PPI networks is a well-known fact in the analysis of biological and social networks. These locations, or vertices, have some unique structural characteristics. Such facts are quantified using various centrality metrics. The vertices and edges of a graph can be ranked from several perspectives based on centrality measurements. To pinpoint “central” nodes in extensive networks, numerous centrality measures (CM) have been developed. The user can choose whatever metric best fits the study of a certain network because there are several options available for ranking influential nodes. The effect of the network architecture on how influential nodes are ranked by centrality metrics further complicates the selection of an appropriate measure. In order to find the centrality metric that is most successful at predicting influential proteins, we looked at the centrality profiles of the nodes of Omicron PPINs. We looked at how a broad range of widely used centrality measurements reflects various topological network properties. This study demonstrates the state-of-the-art in biological network centrality estimations. In order to identify the most significant protein in the network, this research presents 4 centrality metrics [page rank centrality (PRC), degree centrality (DC), closeness centrality (CC), and betweenness centrality (BC)] that are added with some significant scores [node degree and local clustering coefficient (CCo), and p value] on Omicron variant’s PPI networks.

Related Work

Graph structures known as biological networks and social networks can be used to describe a variety of complex systems, including biological and social systems [11]. For determining significant functional characteristics of a network [7, 8], selecting an appropriate set of centrality measurements is essential. [18] the paper has been considered in relation to a critical analysis of centrality measures in social networks. Three straightforward conditions for the behavior of centrality measures were used to analyze certain centrality measures (BC, CC, DC, and eigenvector centrality). The author has been analysis of PPI using Skyline Query on Parkinson’s disease [9]. One of the disorders with the highest rate of global growth, Parkinson’s disease, was shown to have 12 important proteins. The PPI network features have been represented by attributes based on centrality measures. The target genes for cancer illnesses were discovered by the author using protein–protein interaction networks [1]. Hubs and centrality measurements were used to examine the possible genes. They extract the genes with the highest scores in both mutation rates and graph centrality in order to identify the target genes. The author compared 27 popular centrality measurements using yeast PPINs [2]. The measurements classify and arrange the networks’ influential nodes. They have also used hierarchical clustering and principal component analysis (PCA), and they discovered that the topology of the network affects which metrics are the most useful. The author has provided both historical and contemporary research on social network centrality measures in [6] survey paper. They discussed created centrality measurements and mathematical definitions. In addition, they demonstrate various centrality measure uses in the fields of education research [12], biology [11], traffic [14], transportation [25], and security [5, 21]. There are so many applications of centrality measure in different field network [4] such as psychological networks [3, 16], brain networks [15], and differential privacy models [17].

Methods

This study used Omicron lineage variants data. The research has been completed in different steps like, data collection, data cleaning, data validation, creation of PPIN data, centrality measure and finally clustered the whole network in different clusters. The clustering is done by the MCL (Markov clustering algorithm) [20]. The objective of this research work is to get the significant protein or prioritize the protein. For this we have focuses on the centrality measure of the network. Figure 1 illustrates the research workflow.

Fig. 1
figure 1

Flow chart for numerical solution procedure

Data Collection

We have taken the real dataset of Omicron from Universal Protein Resource/Swiss-Prot (UniProt/Swiss-Prot) [23] database which is reviewed and found in the human body. In addition to storing experimental results, computational features, and scientific conclusions, Swiss-Prot is a highly annotated, non-redundant protein sequence database. Currently, the UniProt Knowledgebase is comprised of UniProtKB/Swiss-Prot, which has been reviewed. It provides accurate, consistent, and rich annotations for functional information about proteins. Initially, we have taken a total of 228 proteins: B.1.1.529 (27), BA.5 (30), BA.4 (31), BA.3 (34), BA.2 (38), BA.1.1 (34), BA.1 (34) and analyzed individual Omicron lineage PPIN. The PPIN of Omicron Lineage Variants are shown in Figs. 2, 3, 4, 5, 6 and 7. Then we sum up the data and cleaned the data by removing duplicate data entries to create the Omicron PPIN. The data validation and PPIN data creation in all the cases are done by STRING [22]. There are several sources of information within the STRING database, including computational prediction methods, experimental data, and public text collections. A regular update keeps it up-to-date and it is free to access. In addition, it generates network images using a spring model. In this model, nodes are considered masses, and edges are considered springs. After cleaning the data we gate unique 68 proteins which create the Omicron PPIN.

Fig. 2
figure 2

PPIN of B.1.1.529

Fig. 3
figure 3

PPIN of BA.1

Fig. 4
figure 4

PPIN of BA.2

Fig. 5
figure 5

PPIN of BA.3

Fig. 6
figure 6

PPIN of BA.4

Fig. 7
figure 7

PPIN of BA.5

Centrality Measure

Here, we will discuss very interesting aspect of network measure called centrality. Centrality is basically widely used measure of how central a particular node is with respect to the network. The network that results from the PPI data is thought to be an undirected graph. Each node’s weight in the graph is determined by the centrality approach. The BC, CC, DC, and PRC are a few centrality techniques that can be applied to undirected graphs. Figures 2, 3, 4, 5, 6, 7 and 14 depicts a protein network as an example of an undirected graph. The variant BA.1 and BA.1.1 has the same PPIN only the difference in mutation. The edges of the graph reflect the functional interaction or relationship that takes place between proteins, whereas the nodes in the graph demonstrate the proteins that affect Omicron’s activity.

Degree Centrality

The first basic centrality measure is the degree centrality (DC) [10]. We know that the degree is basically the number of edges which are adjacent on a particular node. The DC is essentially is a degree of a node but it is normalized.

The DC of a node \(v\) is a degree of the node \(v\) and divided by the maximum degree of a node present in the graph. A node’s degree centrality \(C_\textrm{d} (v)\) in a network G(V, E) is denoted mathematically as follows:

$$\begin{aligned} C_\textrm{d} ( v) = \frac{\textrm{deg} ( v)}{\text{ max } \text{ deg}_{u \in v } ( u)}. \end{aligned}$$
(1)

It basically ranges between 0 and 1 and more the degree centrality mean higher the likelihood that the node has maximum degree. The \(C_\textrm{d} ( v)\) can use to identify the more prominent or influential node from a network.

Closeness Centrality

The closeness centrality (CC) [19] indicates how close a node from the rest of the network. A approach to identify nodes that can efficiently spread information throughout a graph is through their CC. Average distance between a node and all other nodes is measured by its proximity centrality. The distances between nodes that have a high proximity score are the shortest. A node’s closeness centrality \(C_\textrm{c}( v )\) in a graph G(V, E) is denoted mathematically as follows:

$$\begin{aligned} C_\textrm{c} ( v) = \frac{|V |- 1}{\sum _{u\in V - { \{ v\} }}^{}d ( u, v)}. \end{aligned}$$
(2)

where number of nodes is given by \(|V |\) and the distance between two nodes \(u\) and \(v\) is represented as \(d(u, v)\). Higher the value of CC, better would be the quality of the particular node. The measure is useful in examining or restricting the spread of disease in epidemic modeling.

Betweenness Centrality

Betweenness (BC) [10] is the measure to compute how central a node is in between paths of the network or we can say to compute how many paths(shortest) of the network passes through the node. A node’s Betweenness centrality \(C_\textrm{b}( v )\) in a network G(V, E) is denoted mathematically as follows:

$$\begin{aligned} C_\textrm{b} ( v) =\sum _{xy \in V - \{v\} }^{} \frac{\sigma _{xy} ( v )}{\sigma _{xy}} \end{aligned}$$
(3)

where the frequency of shortest paths in the network between nodes x and y is indicated by \(\sigma _{xy}\) and \(\sigma _{xy} ( v )\) denotes the same passing through \(v\). If \(x= 1\), then \(\sigma _{xy} = 1\). The BC is useful in identifying the super spreaders in analyzing disease spreading in epidemiology.

Page Rank

PageRank centrality [13] is an adaptation of Eigen centrality that ranks web content using the value of linkages between sites. Any type of network, including protein interaction networks, can be used with it. Mathematically, the pagerank Centrality \(C_\textrm{PR}( v )\) in a network G(V, E) of a node \(v_i\) is defined as

$$\begin{aligned} C_\textrm{PR} ( v_i) = \frac{1-d}{\vert V \vert } + d \sum _{ ( v_t)\in \textrm{Inneighbor} ( v_i)}^{} \frac{C_\textrm{PR}( v_t )}{\textrm{outdeg}( v_t )} \end{aligned}$$
(4)

where d is constant and called damping factor, usually the constant value is considered as 0.85.

Markov Clustering

At the Centre for Mathematics and Computer Science in the Netherlands, Stijn van Dongen created the Markov Cluster Algorithm, MCL algorithm [24]. It is an unsupervised cluster approach for networks that is extremely quick and scalable and is based on the simulation of graph flow. It is employed in bioinformatics and other fields. The distance matrix derived from the STRING global scores in our study serves as the input to MCL. Higher global scores for these interacting proteins increase the likelihood that they will cluster together. The MCL [7, 8] operates primarily in two ways: expanding the operation corresponds to the multiplication of standard matrices and simulates how a flow spreads and becomes more homogeneous. The next is inflation which is described logically as a diagonal scaling proceeded by a Hadamard power. Flow is compressed by inflation by thickening only in areas where current density is high and thinning only in areas where current density is low. There is no way to know how many clusters there are. With the help of the inflation parameter, it is implicitly managed. Higher inflation results in more clusters being obtained, which is indirectly connected to the clustering’s precision. Here, the inflation value has been set at 2.

Results and Discussion

The global properties of Omicron base lineage variants are shown in Table 1. All the seven network except BA.1.1.259 has an average node degree greater than 1. The 3 base lineage (BA.1.1.259, BA.1, BA.1.1) has same density 0.0284. The highest density is 0.06719 (BA.4) and the lowest density belongs to BA.3 (0.00416). The average LCC is pretty good (highest 0.771). The best network is the BA.2 with the smallest p value (0.00038). Table 2 shows the global features of the Omicron PPIN. Node degree is 1.53 on average and the density is 0.0228. The information in Table 3 contains the centrality scores of 68 proteins, which allow us to identify the protein’s relevance. The network has a maximum degree of 7 with an average local clustering coefficient (LCC) of 0.385. The LCC range from 0 to 1, and they represent the density of connections among neighbors. Nodes that have higher values belong to densely connected clusters. The node is considered a part of the clique if it has a value of 1. The proteins GRB7, KCNK17, NDUFB5, NDUFV1, RPSA, SNRPB, SNRPD1, and SNRPE in Table 3 are containing CCo value as 1 as they are part of the clique. Figure 14 is showing the PPI network of Omicron and the score of the CM are visualizing in Figs. 8, 9, 10, 11, 12 and 13. We have calculated the maximum value of each centrality measure and divided it by two to get each category’s threshold value. The threshold value will help us to signify the important protein in the network. We have highlighted the significant protein by getting the intersection of all the important proteins of each category (CC, DC, and, PCR). A total of 8 significant proteins were detected from 68 unique proteins. In our research work, we have extracted the 18 network cluster from the Omicron main network with the help of the Markov clustering algorithm shown in Figs. 14 and 15. In Table 4, we can see, cluster \({\mathcal {C}}_1, {\mathcal {C}}_2\) and \({\mathcal {C}}_3\) has 4 protein in each, \({\mathcal {C}}_4\) to \({\mathcal {C}}_8\) has 4 protein in each and rest of the clusters are containing 2 protein in each.

Table 1 Global properties of Omicron lineage variant’s network
Table 2 Global properties of Omicron network
Table 3 Centrality measure and some important score of 68 Omicron protein
Fig. 8
figure 8

Protein vs. node degree

Fig. 9
figure 9

Protein vs. CCo

Fig. 10
figure 10

Protein vs. DC

Fig. 11
figure 11

Protein vs. CC

Fig. 12
figure 12

Protein vs. BC

Fig. 13
figure 13

Protein vs. PR

Fig. 14
figure 14

Omicron PPI network

Fig. 15
figure 15

PPIN after clustering

Table 4 Generated 18 clusters from MCL algorithm

Conclusion

Centrality analysis are very useful for analyzing large biological networks. Using a candidate gene network of Omicron as a case study, we investigated and compared different centrality measures. According to the findings, it is beneficial to explore candidate gene networks using methods from other fields of science such as social network analysis. On the 7 base lineage of Omicron variations, including the 68 unique protein encoded by the Omicron candidate gene, graph analysis is done. From the Omicron main network, we extracted the Markov clustering algorithm’s findings i.e., 18 network clusters. The primary Omicron network has 68 nodes, each of which represents a protein. Of the 68 proteins, 8 were found to be significant, including AHSG, KCNK1, KCNQ1, MAPT, NR1H4, PSMC2, PTPN11, and UBE21, with the MAPT protein receiving the highest score. The MAPT protein has the most dominating influence on the protein–protein interaction network of the Omicron candidate gene, according to the centrality score. Medical researchers as well as the general public will benefit from this work, as it will allow them to consider biological knowledge in network analysis of the Omicron virus.

Analysis of networks can benefit greatly from centrality measures. However, they are also required to be properly informed, selected, and applied. As part of our main research work, we present information about the four major centrality measures that have been found to be relevant for finding the most significant proteins in the Omicron Lineage Variants PPIN. A wide range of new and large networks are being created and developed due to different applications and different centrality measures. The majority of studies have tried to demonstrate the uniqueness and superiority of their centrality measures. We still have a lot to learn about making a difference and applying them properly. This is how we presented it.