Predicting COVID-19 using lioness optimization algorithm and graph convolution network

Li, Dong; Ren, Xiaofei; Su, Yunze

doi:10.1007/s00500-022-07778-2

Predicting COVID-19 using lioness optimization algorithm and graph convolution network

Data Analytics and Machine Learning
Published: 09 January 2023

Volume 27, pages 5437–5501, (2023)
Cite this article

Download PDF

Soft Computing Aims and scope Submit manuscript

Predicting COVID-19 using lioness optimization algorithm and graph convolution network

Download PDF

Dong Li¹,
Xiaofei Ren¹ &
Yunze Su¹

1768 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, a graph convolution network prediction model based on the lioness optimization algorithm (LsOA-GCN) is proposed to predict the cumulative number of confirmed COVID-19 cases in 17 regions of Hubei Province from March 23 to March 29, 2020, according to the transmission characteristics of COVID-19. On the one hand, Spearman correlation analysis with delay days and LsOA are used to capture the dynamic changes of feature information to obtain the temporal features. On the other hand, the graph convolutional network is used to capture the topological structure of the city network, so as to obtain spatial information and finally realize the prediction task. Then, we evaluate this model through performance evaluation indicators and statistical test methods and compare the results of LsOA-GCN with 10 representative prediction methods in the current epidemic prediction study. The experimental results show that the LsOA-GCN prediction model is significantly better than other prediction methods in all indicators and can successfully capture spatio-temporal information from feature data, thereby achieving accurate prediction of epidemic trends in different regions of Hubei Province.

Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset

Article 27 November 2020

Linkages between environmental issues and zoonotic diseases: with reference to COVID-19 pandemic

Article 29 March 2021

Spatial-temporal graph neural network for traffic forecasting: An overview and open research issues

Article 21 June 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Corona Virus Disease 2019 (COVID-19), which emerged in late 2019 and broke out in early 2020, is one of the most severe public health emergencies since the founding of the People's Republic of China. As of 18:00 on November 30, 2021, China has reported a total of 127, 859 confirmed cases and a total of 5697 deaths. Currently, there are still 2913 confirmed cases; a total of 262, 505, 360 confirmed cases and a total of 5, 227, 122 deaths have been reported globally, and there are presently 20, 202, 074 confirmed cases. Figures 1 and 2 show the distribution of cumulative confirmed cases in China and globally, respectively.^{Footnote 1}

Due to its long duration and wide spread, infectious diseases not only seriously threaten the life safety of people, but also affect people's mental health for a long time. Improper handling can easily cause social chaos. Especially in recent years, sudden epidemic infectious diseases have continued to appear in the world. For example, SARS in 2002 quickly spread to 32 countries and regions in a short time. As of August 16, 2003, the global death toll reached 919, with a case fatality rate of nearly 11%. In 2009, the H1N1 broke out in the United States on a large scale and spread to 214 countries and regions, causing nearly 200, 000 deaths. In addition, there are Ebola virus, HIV, dengue, and so on.

At present, the number of confirmed cases of COVID-19 is still increasing. How to monitor and predict sudden epidemics in a timely and effective manner, and eliminate the crisis in the bud, plays an essential role in the prevention and control of the epidemic. Given these reasons, domestic and foreign scholars have carried out a series of studies on the prediction of infectious diseases. To sum up, these methods can be divided into the following categories:

(1) Time series prediction model. The traditional time series prediction model is an effective tool for disease prediction, which mainly realizes the analysis of the spreading law and development trend of the epidemic by fitting epidemic data. Zhang et al. (2016) decomposed the monthly time series data of 11 class C notifiable diseases in China from 2009 to 2014 into a seasonal index representing the seasonal law of infectious diseases and a linear regression model describing the long-term trend of infectious diseases. And based on this, an autoregressive integrated moving average (ARIMA) model was established to predict the incidence of these infectious diseases. Wang et al. (2019) selected the monthly incidence data of brucellosis from 2004 to 2013 in Jinzhou city, Liaoning Province, China. They analyzed the epidemiological characteristics and prevalence situation by using the autoregressive integrated moving average (ARIMA) model and constructed a prediction model to predict the incidence of brucellosis from 2005 to 2014. Qi et al. (2020) collected the monthly incidence data of Hemorrhagic Fever with Renal Syndrome (HFRS) in Shandong Province from 2009 to 2018. And they used the seasonal autoregressive fractionally integrated moving average (SARFIMA) model to predict the incidence of HFRS. The model evaluation results showed that the SARFIMA model could better fit the dynamic changes of the monthly incidence of HFRS in Shandong Province, and the excellent degree of fitting and predictive ability. Singh et al. (2020) used an autoregressive integrated moving average (ARIMA) model with covariates to fit daily confirmed COVID-19 cases in Malaysia from 22 January to 31 March 2020. Then, they verified this prediction model with confirmed cases from 1 to 17 April 2020. Finally, the daily confirmed cases in Malaysia from 18 April to 1 May 2020 were predicted with satisfaction. Aslan et al. (2022) developed a SEIQR-type deterministic model, which mainly used a system of ordinary differential equations (ODEs) to analyze the dynamics of the epidemic in Hubei Province, and achieved accurate prediction of the local epidemic situation through social distancing rate and confirmed case rate.

(2) Infectious diseases model. In the field of epidemic transmission research, scholars have put forward many models of infectious diseases that can be used to predict the number of infected people, among which the representative ones are: SIR, SIRS, SEIR models, etc. The main idea of this type of model is to divide the population within the epidemic range of infectious diseases into susceptible, exposed, infectious, recovered, etc., and establish and solve differential equations based on the transformation relationship between different groups. Guo et al. (2016) analyzed the transmission data of the Ebola virus by introducing the Jacobian matrix to solve the differential equation of the SIR model and constructed a comprehensive compartment model to fit and predict the transmission trend of the Ebola epidemic. Kuperman et al. (2001) used the SIRS model to study the transmission rules of infectious diseases on the small world network and found that when network nodes change from a recovered state to a susceptible state, infectious diseases will exist in the network for a long time, but the infection rate is low. Ding et al. (2004) used the SIJR model to analyze SARS epidemic data in Hong Kong, Singapore, and Canada and obtained essential parameters such as the epidemic transmission rate, and then realized the prediction of SARS transmission by adjusting the parameters in this model. Merler et al. (2011) and Wu et al. (2020) both established the SEIR model and included “exposed” into the simulated population to simulate the development trend of H1N1 in Europe and the COVID-19 in major cities of China, respectively. In addition, there are still many scholars who extend the infectious disease model by adding or adjusting the internal structure (Wang et al. 2022a, b; Yi et al. 2021; Liao et al. 2020; Cao et al. 2020; Bi et al. 2021). Liu et al. (2020) proposed a susceptible-asymptomatic-infected-removed (SAIR) model based on the social network to describe the spread of COVID-19 and analyzed the real epidemic data in Wuhan and finally selected an appropriate method to control the spread of COVID-19. In the classical SIR model, the transmission speed is a constant. However, in reality, the transmission speed will change through various interventions such as personal protection, community lockdowns, and urban control. Therefore, Jia et al. (2020) added a transmission modifier to the SIR model, established a dynamic extended SIR model (eSIR) for infectious diseases, which can dynamically change the transmission speed and predict the epidemic trend of COVID-19 in Italy. In view of the fact that the SEIR model does not consider the influence caused by random factors such as population migration and flow between regions, Yang et al. (2020) introduced input/output flows on the basis of the SEIR model, making the modified SEIR (M-SEIR) model is more suitable for the actual population change when it is used to predict the urban epidemic. Wang et al. (2022a, b) introduced the complex network model and the intelligent agent on the basis of M-SEIR model, and proposed a city-level structured prediction and evaluation (CSPE) model for COVID-19 epidemic. The model has effectively deduced the epidemic situation in Hubei Province. For the COVID-19 pandemic, Manuel et al. (2020) also proposed a susceptible-exposed-symptomatic-infectious-asymptomatic infectious-recovered (SEIAR) model considering population and disease mortality based on the SEIR model, which aims to reduce the number of infectors by assessing the isolation situation.

(3) Machine learning prediction method. In recent years, many machine learning algorithms have also begun to be used for the prediction of infectious diseases. Wang et al. (2017) used the annual county-level human brucellosis (HB) cases and data of 37 environmental variables in Inner Mongolia, China, to build an artificial neural network (ANN) model, which was used to determine essential predictors of human brucellosis and predict the annual county-level cases of human brucellosis. Chen et al. (2018) introduced different predictors for different diseases, countries, and prediction windows, established different LASSO prediction models, and analyzed the importance of different predictors under various conditions by evaluating the prediction performance of these models. Research results showed that different predictors have different effects in different situations, and short-term forecasts usually perform better than long-term forecasts. Hasan (2020) proposed a hybrid model combining ensemble empirical mode decomposition (EEMD) and artificial neural network (ANN). This model first used EEMD to decompose the epidemic time series data and then sent these decomposed IMF components into the neural network model for training. Eventually, the trained neural network was used to predict the trend of COVID-19. Fang et al. (2020) established a random forest model based on meteorological factors to predict the incidence of infectious diarrhea in Jiangsu Province, China. The prediction model better fitted the dynamic characteristics of the disease epidemic. Gumaei et al. (2021) used gradient boosting regression (GBR) to build a training model to predict the total number of daily confirmed COVID-19 cases around the world. Xie et al. (2021) proposed a nonlinear time-varying transmission rate model based on support vector regression in order to overcome the shortcomings of a single model for insufficient extraction of effective information and low prediction accuracy, and applied it to long-term prediction of COVID-19 epidemic in Hubei Province. Smith and Alvarez (2021) took COVID-19 patients in Wuhan, Hubei, China, as the research object and applied a series of machine learning models to analyze the factors affecting the death of COVID-19 patients. They finally obtained the conclusion that the Shapley value is helpful to determine the death factors of COVID-19.

(4) Deep learning prediction method. In recent years, deep learning methods have performed well in fields such as data mining, analysis, and prediction. They have provided new technologies and techniques for studying epidemic transmission. Chae et al. (2018) used search query data, social media big data, and weather data to sort out the factors that affect the occurrence of infectious diseases to build OLS, ARIMA, DNN, and LSTM prediction models and compare their prediction performance. The study found that the prediction model using deep learning is more suitable for predicting the trend of infectious diseases. Wang et al. (2020) used the LSTM network and rolling update mechanism to continuously input new prediction results into the next iteration model for training, and finally made a long-term prediction of epidemic trends in Russia, Peru, and Iran. In this article, the authors also introduced the Diffusion Index (DI) to evaluate the effectiveness of preventive measures such as social isolation and lockdown. Hao et al. (2020) used Elman neural network, LSTM and SVM to predict the cumulative confirmed cases, deaths, and cured cases in Wuhan, Hubei Province, and used SVM with fuzzy granulation to predict the confirmed new cases, new deaths and new cured cases. Finally, they analyzed the applicability of different methods. Rauf et al. (2021) proposed to use long short-term memory (LSTM) network, Recurrent Neural Network (RNN), and Gate Recurrent Unit (GRU), while considering time variables and data nonlinearity, to evaluate each model feature to predict the number of confirmed COVID-19 cases in the Asia–Pacific region in the next 10 days. Chen et al. (2021) developed a data-driven workflow to extract, process, and develop deep learning methods, and built multivariate long and short-term memory (LSTM) models with different architectures to predict the COVID-19 epidemic, thereby supplementing existing modeling methods.

In addition, when predicting the incidence trend of specific infectious diseases, not all features should be included in the prediction model. Sometimes, you can get better prediction results by deleting some unimportant features. Meta-heuristic algorithm is a common feature selection method, which has been widely concerned by scholars (Jia et al. 2019; Turabieh et al. 2021; Samieiyan et al. 2022). Depending on the source of inspiration, meta-heuristic algorithms can be divided into the four categories: (1) biological evolutionary algorithms that simulate the process and mechanism of biological evolution in nature. Among these algorithms, the most widely used is genetic algorithm (GA)(Holland 1992) with the basic operations of selection, crossover and mutation; in addition, there are differential evolution (DE), evolutionary strategy (ES), etc.; (2) algorithms inspired by physical phenomena and laws in nature, such as multiverse optimizer (MVO) inspired by three concepts in cosmology: white hole, black hole and wormhole, simulating the transfer process of matter in the universe from white hole to black hole through the wormhole (Mirjalili et al. 2015); gravitational search algorithm (GSA) based on Newton's law of universal gravitation (Rashedi et al. 2009); lightning search algorithm (LSA) based on the lightning mechanism (Shareef et al. 2015), etc.; (3) algorithms that simulate various human behaviors in society, such as teaching–learning-based optimization (TLBO) that simulates the two stages of “teaching” and “learning” in the learning process to improve the ability of each individual(Rao et al. 2011); heap-based optimizer (HBO) that uses the heap data structure to map corporate rank hierarchy to simulate three human behaviors(Askari et al. 2020), etc.; (4) swarm intelligence algorithms that simulate the social behavior of animal groups in nature, such as particle swarm optimization (PSO) designed to simulate the predatory behavior of birds(Eberhart and Kennedy 1995); whale optimization algorithm (WOA) proposed by imitating the hunting behavior of humpback whales(Mirjalili and Lewis 2016); chimp optimization algorithm (ChOA) for modeling hunting behavior of chimps according to differences in intelligence and sexual motivation(Khishe and Mosavi 2020); In addition, there are recently proposed marine predators algorithm (MPA)(Faramarzi et al. 2020), k-means clustering optimizer (KO)(Minh et al. 2022), etc. The principles of these algorithms are different, and their abilities to solve different types of problems are also different. As the “no free lunch (NFL)” theorem states, no algorithm can solve all types of optimization problems with equal simplicity and efficiency (Wolpert and Macready 1997). Therefore, we think it is necessary to develop new optimization algorithms to better achieve the purpose of feature selection.

In conclusion, many scholars have made great efforts in infectious disease prediction including COVID-19 prediction, and the results have been fruitful. However, by analyzing the current situation of infectious disease prediction research, we found that the existing models still have the following problems:

(1) Some models are constructed based on the historical data of a single region, ignoring the spatial correlation of infectious disease transmission, which affects the improvement of the model prediction accuracy;
(2) Some prediction models only use directly related data, such as the number of confirmed cases and the number of isolated people, which makes the prediction results unable to timely reflect the influence of other factors (such as social environment and government response policies) in the process of infectious disease transmission, and finally makes the prediction accuracy difficult to be satisfied;
(3) Some models do not consider the time lag effect of eclipse period on transmission. Virus infection can have a certain eclipse period (the time from contact with the source of transmission to the onset of symptoms), and even asymptomatic patients can appear. Ignoring the time lag will also affect the prediction performance;
(4) Finally, when there are many feature data input into the model, there may be some irrelevant features. If these features are forcibly incorporated into the prediction model without selection, it will not only increase the computation, but also affect the prediction performance of the prediction model.

In order to solve the above problems, this paper proposed a graph convolution network prediction model based on the lioness optimization algorithm (LsOA-GCN) according to the transmission characteristics of the COVID-19. The main contributions are as follows:

(1) A novel population-based lioness optimization algorithm was proposed to filter the feature data with time delay (Baidu index), and based on this, a feature matrix for epidemic prediction was constructed;
(2) A COVID-19 prediction model, LsOA-GCN, which comprehensively uses the lioness optimization algorithm and graph convolutional network (GCN) (Kipf and Welling 2017), was proposed;
(3) Experiments were conducted on epidemic data from 17 regions in Hubei Province, China, and compared with 10 prediction methods, which verified the superiority of the LsOA-GCN prediction model.

The rest of this paper is composed as follows. The second part introduces the LSOA-GCN prediction model in detail, including problem definition, adjacency matrix construction, data source and model construction, etc. The third part compares and analyzes the performance of different prediction methods through model evaluation indicators and statistical test methods. The last part summarizes the proposed prediction model.

2 LsOA-GCN model

2.1 Problem definition

(1) City network graph:$G$. In this paper, $G = (V, E)$ is used to describe the topological structure of Hubei Province's city network. We regard each region in Hubei Province as a node in the network graph, where, $V = \{ v_{1} , v_{2} , ..., v_{n} \}$ represents the set of city nodes in Hubei Province, $n$ is the total number of nodes, and $E$ represents the set of edges between cities.

(2) Adjacency matrix: $A^{n \times n}$. The spatial information between cities is mapped into the city network graph $G$, and the degree of correlation between city nodes is represented by edges. This paper intends to use the migration relationship graph to reflect the degree of mutual influence between various regions in Hubei Province, which is represented by $G = (V, A)$, where $A \in R^{n \times n}$ is the adjacency matrix.

(3) Feature matrix: $X^{n \times t \times d}$ We use the Baidu Search Index value of keywords related to the epidemic as the attribute features of the nodes in the network, which was represented by $X \in R^{n \times t \times d}$, where $t$ is the number of days, and $d$ is the dimension of the node attribute features. For example, $X_{t} \in R^{n \times i}$ represents the Baidu Search Index value of $i$ keywords in $n$ regions on the $t$-th day.

Therefore, the epidemic prediction problem solved in this paper can be expressed as: under the premise of a given city network topology graph $G$ and the feature matrix $X$, we planned to predict the cumulative number of confirmed cases in each region of Hubei province in the future through graph convolution network. The formula is as follows:

$$ \left[ {Y_{t + 1} , \ldots , Y_{t + T} } \right] = GCN\left( {G;\left( {X_{t - s} , \ldots , X_{t - 1} , X_{t} } \right)} \right) $$

(1)

where $s$ is the length of the historical time series, $T$ is the length of the future time series that needs to be predicted, $Y_{t + 1}$ represents the cumulative number of confirmed cases in each region of Hubei Province on the ($t$ + 1)-th day, and $X_{t}$ represents the feature vector of each region on day $t$.

2.2 Construction of adjacency matrix

Since the outbreak of the COVID-19 has the characteristic of “person-to-person transmission”, this paper assumes that the large-scale population movement during the epidemic will have a certain impact on the spread of the epidemic. And COVID-19 has a certain eclipse period. During the eclipse period, some patients have only mild respiratory symptoms, such as fever, fatigue, dry cough, etc., but they are also infectious, and there are even asymptomatic patients. Based on this, this paper collected the population migration index values of 17 regions in Hubei province from January 10, 2020 to March 15, 2020 (a total of 66 days) and performed a time-delayed Spearman correlation analysis with the cumulative confirmed cases (from January 24, 2020 to March 29, 2020) in each region to construct an adjacency matrix.

$$ A_{ij} = ave|(spearman(x_{i} , y_{ji} ))|\quad i, j \in [1, 17]{\kern 1pt} $$

(2)

where the Spearman correlation coefficient $\rho = \frac{{\mathop \sum \limits_{i, j} \left( {x_{i} - \overline{x}} \right)\left( {y_{ji} - \overline{y}} \right)}}{{\sqrt {\mathop \sum \limits_{i} \left( {x_{i} - \overline{x}} \right)^{2} \mathop \sum \limits_{i, j} \left( {y_{ji} - \overline{y}} \right)^{2} } }}$, $x_{i}$ represents the cumulative number of confirmed cases in region $i$, and $y_{ji}$ represents the population migration index from region $j$ to region $i$.

2.3 Data source

The epidemic data in this paper come from the daily epidemic notification provided by the National Health Commission (http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml) and the COVID-19 epidemic situation in Hubei Province provided by the Hubei Health Committee (https://wjw.hubei.gov.cn/) to collect the cumulative number of confirmed COVID-19 cases in 17 regions of Hubei Province from January 24, 2020, to March 29, 2020.

The migration data used to construct the adjacency matrix in this paper come from Baidu Maps Huiyan (http://qianxi.baidu.com/) developed by Baidu, and we collected and processed the migration scale index of 17 regions in Hubei Province from January 10, 2020 to March 15, 2020. The keyword data used to construct the feature matrix is also derived from the Baidu Search Index (https://index.baidu.com/v2/index.html#/) provided by Baidu. Set the location parameter as “cities in Hubei Province”, the time parameter as “2020-01-10 ~ 2020-03-15”, and the source parameter as “PC+ Mobile”.

2.4 Model construction

2.4.1 Overall overview

In this section, we give a comprehensive overview of how to use the LsOA-GCN model for epidemic prediction. The LsOA-GCN model is mainly composed of two parts. One is the construction of the feature matrix for capturing temporal information, the other is the graph convolution operation for capturing spatial information. First, the keyword data was sorted out through Spearman correlation analysis with a time lag, and the initial feature matrix was constructed to capture the time series information between the current number of confirmed cases and the Baidu Search Index at different periods in the past. Then LsOA was used to further filter the feature matrix to improve the accuracy of the model and avoid unnecessary feature interference. Finally, the feature matrix was used as input to obtain the spatial features of the epidemic-related data through the graph convolution network, to get the prediction results.

2.4.2 Capture temporal information

In an information society, with the continuous innovation and development of PC and mobile terminals, the Internet has become the primary and critical way for people to collect information from the outside world. Netizens can use various information search platforms to gather the information they want to know and obtain real-time data on the network. According to the 47th “Statistical Reports on Internet Development in China” released by the China Internet Network Information Center (CNNIC), as of December 2020, there have been 989 million Chinese netizens, an increase of 85.4 million compared with March 2020. The Internet has become an essential channel for the public to obtain information. Among the many search platforms, Baidu is the most popular information search platform, with nearly 90% of users in China, accounting for a significant market share. Among them, Baidu Index is a data sharing platform based on massive netizens' behavior data in Baidu. Through the Baidu Index, we can study keyword search trends, gain insights into the interests and needs of netizens, monitor public opinion trends, and locate audience features. In recent years, Baidu Search Index has been used to predict some epidemic diseases, such as dengue (Li et al. 2017), HIV/AIDS (Huang et al. 2020) and Brucellosis (Zhao et al. 2020a). Therefore, it is scientific and feasible to use Baidu Search Index as a feature to predict the COVID-19.

During the epidemic, there is a complicated time series correlation between the current number of confirmed cases and the Baidu Search Index at different periods in the past. Therefore, obtaining the temporal features is also a key issue in epidemic prediction. The first step to obtain time information is the selection of keywords. The selection methods include empirical word selection, technical word selection (Ginsberg et al. 2009), and range word selection (Alessa and Faezipour 2018). The keyword selection used to construct the feature matrix this time mainly adopted the range word selection method, supplemented by the empirical word selection method. Firstly, the selection range of keywords should be roughly determined according to the features of the epidemic, transmission routes, protective measures, etc. Then, relevant words can be recommended through the demand graph function in Baidu Index, while reducing the workload, as far as possible to ensure the completeness of the epidemic-related vocabulary. Based on these, we set up a keyword database from the four dimensions of prevention, symptoms, common words and other related aspects of COVID-19, and initially selected 40 specific keywords, namely “vaccine”, “hand washing”, “mask”, “disinfection”, “nucleic acid”, “flu”, “diarrhea”, “dry cough”, “pneumonia”, “cough”, “fever”, “dyspnea”, “high temperature”, “stuffy nose”, “asymptomatic infection”, “close contact”, “suspected case”, “aerosol transmission”, “infection”, “novel coronavirus”, “NCP”, “SARS-CoV-2”, “COVID-19”, “novel coronary pneumonia”, “novel coronavirus pneumonia”, “virus pneumonia”, “coronavirus”, “novel”, “epidemic”, “Center for Disease Control and Prevention”, “Red Cross Society”, “Spring Festival”, “Health Commission”, “SARS”, “Zhong Nanshan”, “children”, “medical”, “government”, “Wuhan” and “lockdown of the city”.

The specific construction steps of the feature matrix are as follows:

(1) Establish a keyword database from the 4 dimensions mentioned above, and initially screen out 40 keywords;

(2) Time-delayed Spearman correlation analysis is performed between Baidu Index (search index, information index) of 40 keywords and the cumulative number of COVID-19 cases;

(3) Further screen the keyword database and select keyword data with a correlation coefficient greater than 0.5;

(4) Collect keyword data according to the delay days corresponding to the maximum correlation coefficient of each keyword in the thesaurus, and construct a feature matrix based on this;

(5) When dealing with practical problems, not all features must be included in the prediction model, and sometimes deleting some unimportant features may improve the performance of the model. Feature selection is a common data dimensionality reduction method, which has been widely concerned by scholars. They regard feature selection as a 0–1 problem and usually choose the meta-heuristic algorithm to solve (Guo et al. 2022; Jia et al. 2022; Sha et al. 2020). Inspired by this, we used the swarm intelligence optimization algorithm to further filter the feature matrix. Its main purpose is to delete the feature data that is not related to the problem in the original feature matrix, reduce the feature dimension and reduce the running time. This point will be explained in detail in the next part.

2.4.3 Filter feature matrix

In recent years, various intelligent optimization algorithms have emerged one after another, such as the classic Particle Swarm Optimization (Eberhart and Kennedy 1995), Whale Optimization Algorithm (Mirjalili and Lewis 2016), and the newly developed Marine Predator Algorithm (Faramarzi et al. 2020), etc. However, when dealing with practical problems, these algorithms will converge prematurely and easily fall into the risk of local optimum, making it difficult to solve all practical application problems (Faramarzi et al. 2020).

Therefore, we proposed a population-based meta-heuristic optimization algorithm, Lioness Optimization Algorithm (LsOA), to further filter the feature matrix to avoid the interference of unnecessary features. The lion-inspired optimization algorithm appeared in 2012: The Lion's Algorithm (2012) (Rajakumar 2012), which mainly simulates the three behaviors of lion mating, territory takeover, and territory defense, to find optimal solution; The second is Lion pride optimizer (2022) (Wang et al. 2012), which is developed around the mating behavior of lions, and on this basis, three evolutionary strategies are proposed. Survival of the fittest is its main idea; The third chapter is the Lion Optimization Algorithm (2016) (Yazdani and Jolai 2016), which not only draws on the mating and competition behaviors of the lions, but also introduces hunting, migration, patrol and other mechanisms. It can be seen that although the evolution process of the above algorithms incorporate the natural behaviors of lions hunting and mating, they fail to consider some of the core optimization mechanisms in the lion hunting process, resulting in their shortcomings such as slow convergence speed and difficulty in finding global extreme value. Aiming at the shortcomings of the above algorithms, this paper proposed LsOA. Because the optimization process of the algorithm is closer to the hunting behavior of the lions, the lioness is the main performer of the lion hunting. Inspired by this, this algorithm mainly focuses on the hunting behavior of lionesses, and comprehensively considers the two mechanisms of team hunting and elite hunting. During the period, Levy flight and Brownian motion are also introduced. While retaining the strong global exploration ability of the original lion pride algorithms, the local mining ability of the algorithm is improved, which can solve various mathematical optimization problems more effectively, which is completely different from the previous algorithms.

In LsOA, the prey is the global optimal solution, and the current optimal solution is considered to be lioness A or the top lioness. The lions mentioned are all candidate solutions, and they are all lionesses by default. LsOA is mainly composed of the following six mechanisms.

(1)Track prey.

In the hunting process, the lions tend to spread out into a fan shape to surround the prey in the center, and the group of lions in the “center” of the fan shape will consciously adjust their positions according to the positions of other lions and prey. The mathematical model of this behavior is shown in Eqs. (3) and (4), where Eq. (3) defines the step length and direction of the current lion group to the target position, and Eq. (4) defines the position of the lion group after the update.

$$ \vec{D} = \left| {\vec{C} \cdot \overrightarrow {{\text{ Prey }}} - \vec{X}(t)} \right| $$

(3)

$$ \vec{X}(t + 1) = \, \overrightarrow {{\text{ Prey }}} - \vec{A} \cdot \vec{D} $$

(4)

where $t$ is the current iteration number, $\vec{X}(t)$ is the position vector of the lioness, $\vec{D}$ is the distance between the current lioness and the prey, represents the product, $\overrightarrow {{{\text{Prey}}}}$ is the position vector of the prey, $\vec{A}$ and $\vec{C}$ are coefficient vectors, and their calculation formulas are as follows:

$$ \vec{A} = 2\vec{a} \cdot \vec{r}_{1} - \vec{a} $$

(5)

$$ \vec{C} = 2\vec{r}_{2} $$

(6)

where $\vec{a} = 2 - \frac{{2 \cdot {\text{Iter}}}}{{{\text{Max}}\_{\text{iter}}}}$$\in [0, 2]$, to simulate approaching the prey, as the iteration progresses, the value of $\vec{a}$ decreases linearly from 2 to 0, and $\vec{A} \in \left[ { - {{\vec{a}, \vec{a}}}} \right]$. If $|\vec{A}| < 1$, it will attack the prey and focus on exploitation at this time; if $|\vec{A}| > 1$, it will force to stay away from the prey and focus on exploration at this time. $\vec{C} \in \left[ {0, 2} \right]$ is a random value, which represents the weight value of the influence of the position of the current optimal solution on the updated position. $ \vec{r}_{1}, \vec{r}_{2}$ are both random vectors between [0, 1].

(2) Capture prey.

When capturing prey, we believe that the location of the target prey should be near the “central circle” of the hunting team. And this center circle is composed of the four best candidate solutions in the population (lioness A, lioness B, lioness C, and lioness D) and the average of their four. The mathematical model is as follows:

$$ \begin{gathered} \vec{D}_{A} = \left| {\vec{C}_{{1}} \cdot \vec{X}_{A} - \vec{X}} \right|, \vec{D}_{B} = \left| {\vec{C}_{2} \cdot \vec{X}_{B} - \vec{X}} \right| \hfill \\ \vec{D}_{C} = \left| {\vec{C}_{3} \cdot \vec{X}_{C} - \vec{X}} \right|, \vec{D}_{D} = \left| {\vec{C}_{4} \cdot \vec{X}_{D} - \vec{X}} \right| \hfill \\ \end{gathered} $$

(7)

$$ \begin{gathered} \vec{X}_{a} = \vec{X}_{A} - \vec{A}_{1} \cdot \left( {\vec{D}_{A} } \right), \vec{X}_{b} = \vec{X}_{B} - \vec{A}_{2} \cdot \left( {\vec{D}_{B} } \right) \hfill \\ \vec{X}_{c} = \vec{X}_{C} - \vec{A}_{3} \cdot \left( {\vec{D}_{C} } \right), \vec{X}_{d} = \vec{X}_{D} - \vec{A}_{4} \cdot \left( {\vec{D}_{D} } \right) \hfill \\ \end{gathered} $$

(8)

$$ \vec{X}_{ave} = \frac{{\vec{X}_{a} + \vec{X}_{b} + \vec{X}_{c} + \vec{X}_{d} }}{4} $$

(9)

where the calculation of $A_{i}$, $C_{i}$(i=1, 2, 3, 4) are shown in Eq. (5) and Eq. (6), $\vec{X}_{A} , \vec{X}_{B}, \vec{X}_{C}, \vec{X}_{D}$ are the positions of lioness A, lioness B, lioness C and lioness D, respectively, $\vec{X}_{ave}$ is the average of $\vec{X}_{a}, \vec{X}_{b}, \vec{X}_{c}, \vec{X}_{d}$, and $\vec{X}$ is the position of the current prey.

(3) “Central Circle” and Elite Matrix.

“Central Circle”:

According to Eqs. (8) and (9), the positions of the lions at the center of the hunting team can be determined. It is updated based on the positions of the 4 lionesses closest to the prey in the current population, which are $ \vec{X}_{a} , \vec{X}_{b}, \vec{X}_{c}, \vec{X}_{d}$, respectively, and the last candidate is the average $\vec{X}_{ave}$ of these four candidates. These five positions constitute the hunting team's “ Center circle”:

$$ {\text{E\_Team}} = \left[ {\vec{X}_{a} ;\vec{X}_{b} ;\vec{X}_{c} ;\vec{X}_{d} ;\vec{X}_{ave} } \right] $$

(10)

Figure 3 shows the construction process of the “central circle”. Since the position of the target prey in the search space is not known a priori, this article assumes that the position of the target prey may be located near any of these five positions, and the probability of the five candidates being selected is the same, which is 0.2.

Elite Matrix:

The elite matrix is designed based on the elite hunting mechanism. At this time, the population evolves into some agents of the elite, through which agents are used to continuously test the direction and position of the elite, and finally achieve the goal of capturing the prey.

In designing the elite hunting mechanism, this paper did not use a single lioness to carry out optimization iterations, but took the elite hunting mechanism as inspiration, combined with the swarm intelligence mechanism to design: that is, according to the results of the population iteration, the position of the elite is determined on the basis of the lioness of a specific size, and then the optimization iteration is carried out under the guidance of the group search mechanism.

Figure 4 shows the construction process of the elite matrix. In order to avoid the risk of falling into a local optimum caused by relying only on a single elite position, here we draw on the principle of triangular stability, and replace the common method for determining the position of the elite from the common last-round best to a joint decision by the top three agent positions (lioness A, lioness B, and lioness C) in the last-round fitness value. The calculation formula is as follows:

$$ {\text{Top\_lioness\_pos}} = \frac{{\vec{X}_{A} + \vec{X}_{B} + \vec{X}_{C} }}{3} $$

(11)

where $ \vec{X}_{A} , \vec{X}_{B}, \vec{X}_{C}$ are the position vectors of lioness A, lioness B and lioness C, respectively. After the position of the top lioness is determined, copy it $n$ times to build the elite matrix:

$$ Elite_{ - } {\text{lioness}} = \left[ {\begin{array}{*{20}c} {X_{1, 1}^{I} } & {X_{1, 2}^{I} } & \cdots & {X_{1, d}^{I} } \\ {X_{2, 1}^{I} } & {X_{2, 2}^{I} } & \cdots & {X_{2, d}^{I} } \\ \vdots & \vdots & \vdots & \vdots \\ \vdots & \vdots & \vdots & \vdots \\ \vdots & \vdots & \vdots & \vdots \\ {X_{n, 1}^{I} } & {X_{n, 2}^{I} } & \cdots & {X_{n, d}^{I} } \\ \end{array} } \right]_{n \times d} $$

(12)

where ${ }\overrightarrow {{X^{I} }}$ is the position vector of the top lioness in the lion group.

(4) The march strategy.

With the increasing number of optimization iterations, the risk of the lions falling into local optimization is increasing. In order to enable the lions to quickly explore a new area, this paper designed the march strategy. The implication is that when $r2 \le M$, unifies all dimensions of the predator into a unique value (march). Otherwise, each dimension will randomly select a value in the $E\_Team$ to be updated (random). The formulas are as follows:

$$ \Pr ey(i, j) = E\_Team(randi(size(E\_Team, 1)), :), {\text{IF}}(r2 > M) $$

(13)

$$ \Pr ey(i, :) = E\_Team(randi(size(E\_Team, 1)), :), {\text{IF}}(r2 \le M) $$

(14)

where $randi()$ is used to generate pseudo-random integers; size() is used to return the size of the vector. M indicates the degree of influence of the march strategy on the process. The prey here is a matrix with the same dimension as the elite matrix.

(5) Phase-focused mechanism.

In order to avoid premature convergence of LsOA and in reference to the behavioral process of lion hunting, the whole iteration process was divided into three stages in this paper, namely the early iteration, the middle iteration, and the late iteration (Faramarzi et al. 2020). And different search mechanisms were used in each stage.

① the early iteration.

While Iter < $\frac{{1}}{{3}}$ Max_iter

$$ \begin{gathered} \overrightarrow {{{\text{step}}}}_{i} = \overrightarrow {{{\text{RB}}}} \otimes \left( {\overrightarrow {{{\text{Elite}}_{ - } {\text{lioness}}}}_{i} - \overrightarrow {{{\text{RB}}}} \otimes \overrightarrow {{{\text{Pre}} y}}_{i} } \right)i = 1, \ldots n \hfill \\ \overrightarrow {{{\text{Prey}}}}_{i} = \overrightarrow {{{\text{Prey}}}}_{i} + P \cdot \vec{R} \otimes \overrightarrow {{{\text{step}}}}_{i} \hfill \\ \end{gathered} $$

(15)

where Iter is the current iteration number and Max_iter is the maximum iteration number. $\overrightarrow {{{\text{RB}}}}$ is a vector of random numbers representing Brownian motion, the symbol $\otimes$ means multiplication term by term. $P$= 0.5, $R$ is a random vector in [0, 1].

② the middle iteration.

While $\frac{{1}}{{3}}$ Max_iter < Iter < $\frac{2}{3}$ Max_iter.

For prey:

$$ \begin{gathered} \overrightarrow {{{\text{step}}_{i} }} = \overrightarrow {{{\text{RL}}}} \otimes \left( {\overrightarrow {{{\text{Elite}}_{ - } {\text{lioness}}}}_{i} - \overrightarrow {{{\text{RL}}}} \otimes \overrightarrow {{{\text{Prey}}}}_{i} } \right)i = 1, \ldots , n/2 \hfill \\ \overrightarrow {{{\text{Prey}}}}_{i} = \overrightarrow {{{\text{Pre }}y}}_{i} + P \cdot \vec{R} \otimes \overrightarrow {{{\text{step}}_{i} }} \hfill \\ \end{gathered} $$

(16)

where $\overrightarrow {{{\text{RL}}}}$ is the random number vector representing Levy's flight and $\overrightarrow {{{\text{RL}}}} \otimes \overrightarrow {{{\text{Prey}}}}_{i}$ simulates the behavior of prey. For lions, this study assumed:

$$ \begin{gathered} \overrightarrow {{{\text{step}}_{i} }} = \overrightarrow {RB} \otimes \left( {\overrightarrow {RB} \otimes \overrightarrow {{{\text{Elite}}_{ - } {\text{lioness}}}}_{i} - \overrightarrow {{{\text{Prey}}}}_{i} } \right)i = n/2, .., n \hfill \\ \overrightarrow {{\text{ Prey}}}_{i} = \overrightarrow {{{\text{Elite}}_{ - } {\text{lioness}}}}_{i} + P.CF \otimes \overrightarrow {{{\text{step}}_{i} }} \hfill \\ \end{gathered} $$

(17)

where ${\text{CF}} = \left( {1 - \frac{{\text{ Iter }}}{{{\text{Max}}_{ - } {\text{Iter}}}}} \right)^{{\left( {2\frac{{\text{ Iter }}}{{{\text{Max}}_{ - } {\text{Iter}}}}} \right)}}, $ it is regarded as an adaptive parameter that controls the step length of the lion's movement. $\overrightarrow {RB} \otimes \overrightarrow {{{\text{Elite}}_{ - } {\text{lioness}}}}_{i}$ simulates the behavior of a predator (lioness).

③ the late iteration.

While Iter > $\frac{2}{{3}}$ Max_iter

$$ \begin{gathered} \overrightarrow {{{\text{step}}_{i} }} = \overrightarrow {RL} \otimes \left( {\overrightarrow {RL} \otimes \overrightarrow {{{\text{Elite}}_{ - } {\text{lioness}}}}_{i} - \overrightarrow {{{\text{Prey}}}}_{i} } \right)i = 1, \ldots , n \hfill \\ \overrightarrow {{{\text{Prey}}}}_{i} = \overrightarrow {{{\text{Elite}}_{ - } {\text{lioness}}}}_{i} + P.CF \otimes \overrightarrow {{{\text{step}}_{i} }} \hfill \\ \end{gathered} $$

(18)

where $\overrightarrow {RL} \otimes \overrightarrow {{{\text{Elite}}_{ - } {\text{lioness}}}}_{i}$ simulates the behavior of a predator (lioness).

(6) The impact of FADs.

FADs (Filmalter et al. 2011) is called “fish aggregating devices”, which is a kind of raft-like structure, usually made of some waste products. This kind of device is easy to attract and gather a large number of fish, making it difficult for them to escape. There are similar facilities (regions) on the African savannah, such as wetlands or swamps. When lions hunt here, they often get caught in them and cannot get out.

Given this point, this paper introduced FADs into the algorithm. FADs is considered to be local optima, and the effect is that lions are trapped in these areas. The FADs effect is expressed mathematically as:

$$ \overrightarrow {{{\text{Prey}}}}_{i} = \left\{ {\begin{array}{*{20}l} {\overrightarrow {{{\text{Prey}}}}_{i} + CF\left[ {\vec{X}_{\min } + \vec{R} \otimes \left( {\vec{X}_{\max } - \vec{X}_{\min } } \right)} \right] \otimes Ma} \hfill & {{\text{ if }}r \le FADs} \hfill \\ {\overrightarrow {{{\text{Prey}}}}_{i} + [FADs(1 - r) + r]\left( {\overrightarrow {{{\text{Prey}}}}_{r1} - \overrightarrow {{{\text{Prey}}}}_{r2} } \right)} \hfill & {{\text{ if }}r > FADs} \hfill \\ \end{array} } \right. $$

(19)

where $FADs = 0.2$, which means the probability that FADs will affect the optimization process.$r$ is a uniform random number in [0, 1]. $r1$, $r2$ represents the random index of the prey matrix. $\vec{X}_{\max }$, $\vec{X}_{\min }$ are the maximum and minimum values of each dimension, respectively. $Ma$ is a binary vector containing only 0 and 1. First, generate a random vector in [0, 1], if it is less than 0.2, it will be regarded as 0; otherwise, it will be regarded as 1. The pseudo-code of LsOA is as follows:

Note: The value of each dimension is within the range of [0, 1]. If the value is greater than 0.5, it is regarded as 1; otherwise, it is regarded as 0. If the value in a certain dimension is 1, it means that the corresponding feature vector in this dimension is selected, otherwise, it will not be selected. The calculation formula of the fitness value is $fitness = GCN\left( {G;X} \right)$, $G$ is the city network graph, $X$ is the feature matrix, and the output result is the value of Root Mean Squared Error (RMSE).

2.4.4 Capture spatial information

At the beginning of the outbreak, it was the Spring Festival travel season. Most people are taking chances, and their awareness of epidemic prevention is weak. The large-scale population movement has caused the epidemic to spread quickly across the country in a short time. Since the outbreak of the COVID-19 has the feature of “person-to-person transmission”, when forecasting the number of confirmed COVID-19 cases in a certain region, it is necessary to consider not only the number of confirmed cases in the region in the past period of time, but also the inflow of confirmed cases from the surrounding regions to the region. How to characterize the interrelationships between complex spaces has become another vital issue in epidemic prediction.

The conventional convolutional neural network (convolutional neural network, CNN) can get local spatial information, but it can only apply to Euclidean space, such as images, regular grids, and so on. The principle is to get the spatial relationship features of the image through the convolution kernel. The typical CNN two-dimensional convolution is shown in Fig. 5. However, the city network is not a two-dimensional grid, but in the form of a graph, which means that CNN cannot reflect the complex topology of the city network and cannot precisely obtain spatial information. In order to be able to extend convolution to topological graphs of non-European data structures, GCN is born. It can extract features from many irregular data structures in real life, typical of which is graph structure, or topological structure.

Give a set of graph data, which contains N nodes, each node has its own features, we used the nodes with D-dimensional features to construct a feature matrix $X \in R^{N \times D}$, and then used the mutual relationship between the nodes to construct the adjacency matrix $A \in R^{N \times N}$. Finally, input the feature matrix $X$ and the adjacency matrix $A$ into the network model. GCN is also a neural network layer, and its layer-to-layer propagation mode is as follows:

$$ X^{(l + 1)} = \sigma \left( {\tilde{D}^{{ - \frac{1}{2}}} \tilde{A}\tilde{D}^{{ - \frac{1}{2}}} X^{(l)} W^{(l)} } \right) $$

(20)

where $X^{(l)}$ represents the feature matrix of the $l$-th layer, $W^{(l)}$ represents the learnable parameters of the $l$-th layer. $\tilde{A} = A + I_{N}$, $A$ is the adjacency matrix, which is used to reflect the network topology in the graph data. $I_{N}$ is the identity matrix, which is used to extract the features of its own node. $\tilde{D} = \mathop \sum \limits_{{}} \tilde{A}_{ij}$ is the degree matrix, and the degree matrix is used to measure the importance of neighbor nodes. $\sigma$ is a nonlinear activation function.

It can be seen from the above formula that the GCN model is equivalent to a feature extractor, which acts on the nodes of the graph, and gets together its first-order neighbors according to different weights, thereby capturing the spatial features between nodes. The specific operation is shown in Fig. 6. Suppose the orange node is the central node, and the blue nodes are the first-order neighbor node of the central node. According to the network topology, the central node aggregates the features of itself and its first-order neighbor nodes, and finally obtains spatial information from the feature data.

The LsOA-GCN model used in this paper constructed a GCN model by superimposing two convolutional layers. A two-layer GCN model can be expressed as:

$$ X^{{\left( {l + 2} \right)}} = Relu\left( {\hat{A}Relu\left( {\hat{A}X^{\left( l \right)} W^{\left( l \right)} } \right)W^{{\left( {l + 1} \right)}} } \right) $$

(21)

where $X^{\left( l \right)}$ represents the feature matrix of the $l$-th layer; ${\text{Relu()}}$ is the activation function; $\hat{A} = \tilde{D}^{{ - \frac{1}{2}}} \tilde{A}\tilde{D}^{{ - \frac{1}{2}}}$, $\tilde{A} = A + I_{N}$, $A$ is the adjacency matrix, and $\tilde{D} = \mathop \sum \limits_{{}} \tilde{A}_{ij}$ is the degree matrix; $W^{\left( l \right)}$$W^{{\left( {l + 1} \right)}}$ represent the learnable parameters of layer $l$ and layer $l + 1$, respectively.

2.4.5 Loss function

The loss function is the most basic and critical element of the prediction model in the training process. Its purpose is to minimize the error between the actual value and the predicted value of cumulative confirmed cases. The loss function we defined in the LsOA-GCN model was the mean square loss function. The specific formula is as follows:

$$ loss = \frac{{\sum\nolimits_{t = 1}^{T} {(Y_{t} - \hat{Y}_{t} )^{2} } }}{T} $$

(22)

where $Y_{t}$ and $\hat{Y}_{t}$, respectively, represent the actual and predicted the cumulative number of confirmed cases on day $t$.

2.4.6 LsOA-GCN model

In order to capture the temporal and spatial information of the feature data simultaneously, we used a graph convolutional network prediction model based on the lioness optimization algorithm (LsOA-GCN). The overall process of LsOA-GCN is shown in Fig. 7. The right part is the flow chart of this prediction model, and the left part is the feature screening process.

In short, the LSOA-GCN model can handle complex temporal dependence and spatial correlation. On the one hand, Spearman correlation analysis with delay days and LsOA are used to capture the dynamic changes of feature information to get the temporal features. On the other hand, we obtain the structure of the city network by graph convolutional network, to get spatial information and finally realize the prediction task.

3 Experiments

3.1 Data set

To verify the performance of the LsOA-GCN model, we predicted the number of confirmed cases of the epidemic for 66 days in 17 regions of Hubei Province. Due to proper epidemic control measures, the number of newly confirmed cases in Hubei Province was almost zero in the later period. Therefore, the prediction target of the confirmed epidemic data set in Hubei Province was set as the cumulative number of daily confirmed cases.

In the experiment, we selected 17 regions in Hubei Province as the primary research areas, namely Wuhan, Ezhou, Enshi, Huanggang, Huangshi, Jingmen, Jingzhou, Qianjiang, Shennongjia, Shiyan, Suizhou, Tianmen, Xiantao, Xianning, Xiangyang, Xiaogan and Yichang. The experimental data was mainly composed of three parts: The first was the migration data from January 10, 2020 to March 15, 2020. The 17*17 adjacency matrix was constructed mainly based on the correlation between the migration scale index and the cumulative number of confirmed cases; The second was the keyword data from January 10, 2020 to March 15, 2020 (the date range varies according to the number of delayed days), which was used to construct a 17*66*44 feature matrix; The third was the epidemic data from January 24, 2020, to March 29, 2020. The daily cumulative number of confirmed cases in each region of Hubei Province was used as the final target value to form a 17*66 target value matrix. During the period, all experimental data used were normalized, and according to the ratio of 9:1, the data set was divided into the training set and the test set. The length of the historical time series was set to 1, and finally the cumulative number of confirmed cases in the next day was predicted. The specific division is shown in Fig. 8.

3.2 Evaluation indicator

At present, there are many indicators used to evaluate regression prediction models (Zhao et al. 2020b). RMSE, mean absolute error (MAE), mean absolute percentage error (MAPE), accuracy (Acc), coefficient of determination (R2), and explained variance score (Var) were selected as evaluation indicators in this paper to evaluate the difference between the real value $Y_{t}$ and the predicted value $\hat{Y}_{t}$ of the cumulative number of confirmed cases. RMSE, MAE, and MAPE are mainly used to measure the difference between the predicted value and the real value. The smaller the three indicators are, the smaller the difference between the predicted value and the actual value is, and the better the performance of the prediction model is. Accuracy is mainly used to test the prediction accuracy of the model. R2 and Var are mainly used to measure the fitting degree between the predicted value and the actual value. The larger the three indicators are, the better the prediction effect is. The specific calculation formulas of these six indicators are as follows:

(1) RMSE:

$$ RMSE = \sqrt {\frac{1}{T}\mathop \sum \limits_{t = 1}^{T} \left( {Y_{t} - \hat{Y}_{t} } \right)^{2} } $$

(23)

(2) MAE:

$$ MAE = \frac{1}{T}\mathop \sum \limits_{t = 1}^{T} \left| {Y_{t} - \hat{Y}_{t} } \right| $$

(24)

(3) MAPE:

$$ MAPE = \frac{100\% }{T} \times \sum\limits_{t = 1}^{T} {|\frac{{Y_{t} - \hat{Y}_{t} }}{{Y_{t} }}} | $$

(25)

(4) Acc:

$$ Accuracy = 1 - \frac{{\parallel Y - \hat{Y}\parallel_{F} }}{{\parallel Y\parallel_{F} }} $$

(26)

(5) R2:

$$ R^{2} = 1 - \frac{{\mathop \sum \nolimits_{t = 1} \left( {Y_{t} - \hat{Y}_{t} } \right)^{2} }}{{\mathop \sum \nolimits_{t = 1} \left( {Y_{t} - \overline{Y}} \right)^{2} }} $$

(27)

(6) Var:

$$ {\text{var}} = 1 - \frac{{Var\{ Y - \hat{Y}\} }}{{Var\{ Y\} }} $$

(28)

where $T$ is the predicted days; $\parallel \cdot \parallel_{F}$ is the Frobenius norm that refers to as F-norm, which is a matrix norm that can be used to compare the error between the real matrix and the estimated matrix; $Var\{ \cdot \}$ is the variance calculation.

3.3 Correlation analysis between Baidu Index and the cumulative number of confirmed cases of COVID-19

Since Baidu Index includes search index and information index, 40 keywords constitute 80 feature vectors. The search index shows the degree of Internet users' attention to keyword search and the continuous changes. It is based on the data of Internet users' search volume on Baidu, with keywords as statistical objects, scientifically analyzes and calculates the weight of the search frequency of each keyword in Baidu. The information index shows the degree of attention and coverage of specific keywords and continuous changes in news information on the Internet. It is based on Baidu's intelligent distribution and recommended content data and is obtained by the weighted summation of the number of netizens' reading, commenting, forwarding, like, dislike and other behaviors. We first preprocessed the data, deleted the search index or information index of keywords with all zeros, and finally contained only 61 feature vectors.

Figure 9 shows the trend comparison between the five randomly selected feature vectors and the cumulative number of confirmed cases, from January 10, 2020, to March 15, 2020. The five feature vectors are coronavirus (S), dry cough (S), fever (S), diarrhea (X), and hand washing (S), where (S) represents the search index of the keyword and (X) describes the information index of the keyword. Since the ordinate used a logarithmic scale with a base of 10 and negative or zero values could not be drawn correctly on the logarithmic graph, keywords with the Baidu Index value of 0 would appear truncated and had no value in the graph.

As can be seen from Fig. 9, the change of Baidu Index values of individual keywords is earlier than the change of cumulative confirmed cases of COVID-19. This showed that, based on certain keywords, we could predict the epidemic trend in the future to a certain extent, and there was an apparent lag between the Baidu Index of keywords and the cumulative number of confirmed cases. Because of this, we assumed that the Baidu Index could be used as a predictive indicator in a period of time before the diagnosis, that is, the feature vector, and then the number of confirmed cases could be predicted.

Based on the above analysis, in this part, we mainly analyzed the correlation between the daily cumulative number of confirmed cases of the epidemic and the Baidu Index. Due to the eclipse period of COVID-19, there are even asymptomatic infections. And as the main feature vector of the prediction model, the Baidu Index needs to be at least a certain number of days in advance to have practical application value. Therefore, the Baidu Index values of keywords 0 to 14 days in advance (epidemic data was relatively lagging by 0 to 14 days) were successively used for correlation analysis with epidemic data.

Taking Wuhan as an example, we performed Spearman correlation analysis on the Baidu Search Index (search index, information index) of keywords with time delay (0-14d) and the cumulative number of confirmed cases in Wuhan. The results are shown in Table 1, and the correlation analysis results of other 16 regions are shown in Appendix Tables 13–28.

Table 1 Correlation analysis between keywords and the cumulative number of confirmed cases–Wuhan

Methods	2	3	4	5	6	7
q0.05	1.960	2.241	2.394	2.498	2.576	2.638
q0.10	1.645	1.960	2.128	2.241	2.326	2.394

Methods	8	9	10	11	12	13
q0.05	2.690	2.724	2.774	3.219	3.268	3.313
q0.10	2.450	2.498	2.539	2.978	3.030	3.077

Predicting COVID-19 using lioness optimization algorithm and graph convolution network

Abstract

Similar content being viewed by others

Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset

Linkages between environmental issues and zoonotic diseases: with reference to COVID-19 pandemic

Spatial-temporal graph neural network for traffic forecasting: An overview and open research issues

1 Introduction

2 LsOA-GCN model

2.1 Problem definition

2.2 Construction of adjacency matrix

2.3 Data source

2.4 Model construction

2.4.1 Overall overview

2.4.2 Capture temporal information

2.4.3 Filter feature matrix

2.4.4 Capture spatial information

2.4.5 Loss function

2.4.6 LsOA-GCN model

3 Experiments

3.1 Data set

3.2 Evaluation indicator

3.3 Correlation analysis between Baidu Index and the cumulative number of confirmed cases of COVID-19

3.4 Performance analysis of lioness optimization algorithm on benchmark functions

3.5 Setting training parameters and comparison methods

3.5.1 Training parameters

3.5.2 Comparison methods

3.6 Performance evaluation

3.7 Statistical test

4 Conclusion

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendices

Appendix A

Appendix B

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation