Abstract
The outbreak of the novel coronavirus (COVID-19) was declared to be a global emergency in January of 2020, and everyday life throughout the world was disrupted. Among many questions about COVID-19 that remain unanswered, it is of interest for society to identify whether there is any significant difference in daily case counts between males and females. The daily case count sequences are correlated due to the nature of a contagious disease, and contain a nonlinear trend owing to several unexpected events, such as vaccinations and the appearance of the delta variant. It is possible that these unexpected events have changed the dynamical system that generates data. The classic t-test is not appropriate to analyze such correlated data with a nonconstant trend. This study applies a simultaneous confidence band approach in an attempt to overcome these difficulties; that is, a simultaneous confidence band for the trend of an autoregressive moving-average time series is constructed using B-spline estimation. The proposed method is applied to the daily case count data of seniors of both genders (at least 60 years old) in the State of Ohio from April 1, 2020 to March 31, 2022, and the result shows that there is a significant difference at the 95% confidence level between the two gender case counts adjusted for the population sizes.
Similar content being viewed by others
1 Introduction
The outbreak of the coronavirus disease (COVID-19) in late 2019 has profoundly changed all walks of life and has raised many challenging research questions over the development of the pandemic. It is the first such pandemic outbreak since the 1918 influenza, commonly known by the misnomer “Spanish flu”. It is the first time in human history that a huge amount of data related to a pandemic has been recorded, which makes it possible to analyze the biological, medical and social aspects of COVID-19, and which provides statisticians and data scientists with tremendous yet challenging opportunities for COVID-19 studies [1].
There have been some attempts to obtain valuable information for society to help identify risk factors and gain an understanding of the coronavirus, in spite of there being so far a limited amount of publicly available data. The primary goal of the paper is to propose an appropriate statistical learning method, using publicly available COVID-19 data, to obtain information about how the coronavirus affects male and female seniors differently.
Many aspects of this virus have been studied based on publicly accessible data which are updated daily from all over the world [2,3,4,5]. In particular, the effects of gender and age on COVID-19 have received a great deal of attention. Senior people tend to have many comorbidities which are associated with most of the senior deaths [6,7,8,9], and the coronavirus tends to exacerbate preexisting health problems. In addition, gender seems to play a role too, and male tends to be more vulnerable. The paper [10] presented the age and gender effects on the fatality rate using the COVID-19 data of the State of Ohio. The authors applied logistic regression to identify factors on the death probability of a COVID-19 patient and concluded that age and gender were significant factors. One of the critical assumptions for logistic regression is that the log odds ratio can be modeled by a linear function [11]. However, many unexpected events, such as vaccination that began in early 2021 and the delta variant that appeared in June–July 2021, could have changed the data generating mechanism. In other words, the linear assumption of logistic regression is too restrictive.
This study focuses on the senior case count comparison between April 1, 2020 and March 31, 2022, inclusive, and aims at investigating whether the impact of male and female on the case count is significant by modeling the difference of log transformed case counts of the two genders \(\{y_{t}\}_{t=1}^n\) after population size adjustment. The paper [12] proposed a correction in the analysis of variance test for autoregressive moving-average time series (ARMA). However, the time series \(\{y_{t}\}_{t=1}^n\) are not only correlated due to the nature of a contagious disease, but also contain a nonlinear trend due to several unexpected events, such as vaccinations and the appearance of the delta variant. The dots in Fig. 1 are the observations of \(\{y_{t}\}_{t=1}^n\), which do not unveil any pattern of the trend or of the mean function that can be easily described by an analytical function. This observation about the data plotted in Fig. 1 suggests that a nonparametric approach, such as splines, is more appropriate. B-spline estimation has been widely applied to describe an unknown mean function [13,14,15,16]. In particular, [17] proposed a simultaneous confidence band (SCB) approach for the mean function based on B-spline estimation, and [18] extended it to autoregressive time series with unknown nonlinear trend function.
The paper extends the confidence band approach of AR to a more general ARMA setting, and applies it to the COVID-19 senior daily case count comparison of males and females in the State of Ohio. The paper is organized as follows. In Sect. 2, the details of the data set to be analyzed are provided; in Sect. 3, the B-splines trend function estimation is reviewed; in Sect. 4, the construction of confidence bands is described, and some simulation studies are presented; in Sect. 5, the confidence band approach is applied to the senior COVID-19 data; in Sect. 6, the paper concludes with some discussion.
2 Data
The State of Ohio established a websiteFootnote 1 which provides the public with information and data about COVID-19 in the state. The daily case count data of male and female seniors (at least 60 years old) from April 1, 2020 to March 31, 2022, inclusive, were downloaded from the website. There are \(n=730\) daily case counts for each group, and there is no missing data or zero count during this period. According to the summary of the statistics in Table 1, the female case count has larger values in all major statistics, such as mean and median. This case discrepancy might be resulted in by the different of the populations of these two groups—the male senior population is smaller than the female counterpart in Ohio. Define a time series \(\{y_t\}_{t=1}^{n}\) as the differenced log transformed daily case counts of the two groups: male senior COVID-19 patients and female senior COVID-19 patients. The scatter plot in Fig. 1 shows that \(\{y_t\}_{t=1}^{n}\) exhibits some nonlinear pattern, and tends to be negative, or the male senior group seems to have a lower case count than the female counterpart.
The population difference is taken into account and calculated according to the information in the State of Ohio in 2019–2020.Footnote 2 The horizontal dashed line in Fig. 1 is \(y=-\,0.227\), which is the differenced log male and female senior populations. Most of the observations \(\{y_t\}_{t=1}^{n}\) are above the line \(y=-\,0.227\), which suggests that the male senior group has a higher case rate or are more vulnerable to COVID-19. However, in Fig. 1, the case count of the male senior group outnumbered the case count of the female senior group at the first peak which occurred in the beginning of the pandemic on April 16, 2020, but the opposite situation happened at the second large peak more than a year later on June 19, 2021. Therefore, it is not obvious which gender group is more vulnerable, and the confidence band approach proposed in this paper will provide rigorous statistical evidence to this end.
3 B-spline Trend Estimation
The proposed confidence band is constructed based on a B-spline estimator for the mean function or trend g. B-spline estimation for functions is one of the nonparametric methods that are widely applied. We start with the B-spline estimation and provide the details about how to construct a \((1-\alpha )\%\) simultaneous confidence band.
A realization \(\{y_t\}_{t=1}^{n}\) of a time series is assumed to follow the model below:
where \(g(u_t)\) is a smooth trend function with \(u_t=t/n\) and \(\{x_t\}\) is an autoregressive and moving-average time series with orders p and q (ARMA(p, q)) with mean \(\textsf{E}(x_t)=0\) and covariance \(\gamma _k=\textsf{Cov}(x_t, x_{t-k})\)
In the ARMA model (2) above, \(\{\epsilon _t\}\) is a white noise sequence; that is, mean \(\textsf{E}(\epsilon _t)=0\) and variance \(\textsf{Cov}(\epsilon _t, \epsilon _{t-k})=\sigma ^2\) if \(k=0\) and 0 otherwise. Hereafter, bold letters are used for vectors. The default of a vector is column, and a superscript T is used sometimes for the transpose operation of a matrix. For example, \(\varvec{y}=(y_{1}, \ldots , y_{n})^{T}\) is a vector that is corresponding to all the observations.
This paper considers the case that the unknown trend function g is estimated by degree one B-splines or piecewise linear B-splines for two reasons: a higher degree B-spline estimator does not provide much more gain; a rigorous proof for a confidence band based on the piecewise linear B-splines can be obtained. Let N be an integer, and let \([-h, 1]\) be divided into \((N+2)\) equally spaced subintervals \(J_j=\left[ jh, (j+1)h\right) , j=-1, 0, \ldots , N-1\) and \(J_N=\left[ Nh, 1\right]\), of width \(h=(N+1)^{-1}\). The integer N is a tuning parameter.
The basis of a piecewise linear spline space is \(\varvec{b}(u)=(b_{-1}(u), \ldots , b_{N}(u))^T\) with \(b_{j}(u)\) being defined as follows:
where j(u) is the location index of u for \(u\in J_{j(u)}\). Any \(b_{-1}(u)\) is not zero in at most two adjacent subintervals. Figure 2 is an example of the degree one B-splines \(\{b_{j,1}(u)\}_{j=-1}^3\), where \(h=0.25\) and \(N=3\). The standardized basis \(\varvec{c}(u)=(c_{-1}(u), \ldots , c_{N}(u))^T\) is defined as
with
Next, define vectors \(\{\varvec{c}_{j}\}_{j=-1}^N\) for a realization of time series \(\varvec{y}\) by \(\varvec{c}_{j}\) = \((c_{j}(u_1),\) \(\ldots\), \(c_{j}(u_n))^{T}\), and a \(n\times (N+2)\) matrix \({\varvec{C}}=(\varvec{c}_{-1}, \ldots , \varvec{c}_N)\). The B-spline estimator \({\hat{g}}\) is a linear combination of \(\varvec{c}_{j}\)
where the coefficients \(\hat{\varvec{\beta }}=({\hat{\beta }}_{-1}, \ldots , {\hat{\beta }}_N)^T\) minimizes
Thus from the classic theory of the least squares we can obtain for \(u \in [0, 1]\),
More detailed descriptions about B-splines and their applications in Statistics can be found, for example, in [19] and [20].
4 Construction of Confidence Band
A simultaneous confidence band is constructed from the residuals of the trend estimate in (6). There are two major steps: in the first step, \({\hat{g}}\) is calculated by (6), and residuals \(\{{\hat{x}}_{t}\}_{t=1}^{n}\) are calculated from \({\hat{x}}_{t}=y_{t}-{\hat{g}}(u_t)\); in the second step, the parameters and estimates needed for the confidence band are obtained based on the matrix \(\varvec{C}\) and \(\{{\hat{x}}_{t}\}_{t=1}^{n}\), and thus a \((1-\alpha )\%\) SCB is constructed. For an integer \(m\ge 0\) and \(\nu \in \left[ 0,1\right]\), denote by \(C^{\left( m,\nu \right) }\left[ 0,1\right]\) the space of functions whose m-th derivatives satisfy Hölder conditions of order \(\nu\), i.e.,
The following assumptions are necessary for the proposed \((1-\alpha )\%\) SCB for a smooth trend function g(u).
-
1.
The trend function \(g(\cdot )\in C^{\left( m,\nu \right) }\left[ 0,1\right]\), \(m<2\).
-
2.
The number of interior knots N satisfies \(n^{1/4\left( m+\nu \right) }\ll N\ll n^{1/2}\).
-
3.
The parameter space \(\Xi\) is compact and consists of \(\varvec{\alpha }\) such that all roots of \(\Phi ({\varvec{\alpha }},z)=0\) and \(\Theta ({\varvec{\alpha }},z)=0\) are larger than one in absolute value, and they have no common roots. The true parameter value \({{{\varvec{\alpha }}_{0}}}\) is in the interior of the parameter space \(\Xi\).
-
4.
The innovations \(\left\{ \epsilon _{t}\right\} _{t=-\infty }^{\infty }\) are i.i.d. with \(\hbox {E}(\epsilon _{1}^{4})<\infty\).
These assumptions ensure that some uniform consist properties of \({\hat{g}}\) as an estimate of g and of the residuals \({\hat{x}}_t\) as a substitute for unobservable ARMA time series \(x_t\). According to [21], under assumptions 1–2, it is straightforward to obtain
according to [22], under assumption 3, a ARMA model is causal and invertible; under assumptions 1–4, the asymptotic properties of the maximum likelihood estimators for ARMA coefficients are not altered using the residuals \(\{{\hat{x}}_{t}\}_{t=1}^{n}\)
as a substitute.
One of the parameters needed in the confidence band is the zero spectrum \(\tau ^2\) of the ARMA series \(\{x_{t}\}_{t=1}^{n}\):
where \(\gamma _{k}\) is the autocorrelation at lag k, \(\gamma _{k}=\textsf{Cov}(x_t, x_{t+k})\). One can obtain the estimate \({\hat{\gamma }} _{k}\) of autoregressive covariance at any lag k by the sample autocovariance based on \(\{{\hat{x}}_t\}_{t=1}^n\) as follows
However, it requires a large sample size. For the COVID-19 data to be analyzed, \(n=730\), and the method introduce below works better.
Another method to obtain \({\hat{\gamma }}_{k}\) is based on the residual sequence \(\{{\hat{x}}_{t}\}_{t=1}^{n}\). In particular, \(\{{\hat{x}}_{t}\}_{t=1}^{n}\) is treated as the replacement of unobservable time series \(\{x_{t}\}_{t=1}^{n}\), and the maximum likelihood estimates \(\hat{\varvec{\phi }}=({\hat{\phi }}_{1},\ldots , {\hat{\phi }}_{p})^T\) is calculated for the autoregressive coefficients and \(\hat{\varvec{\theta }}=({\hat{\theta }}_{1}, \ldots , {\hat{\theta }}_{q})^{T}\) for the moving-average coefficients. Since these estimators \(\left\{ \hat{\varvec{\phi }}, \hat{\varvec{\theta }}\right\}\) are consistent according to [21], one can obtain the estimate \({\hat{\gamma }}_{k}\) from the ARMA model:
where \(\{{\hat{\epsilon }}_t\}_{t=1}^n\) is the residuals of the ARMA(p, q) time series model. The theoretical autocorrelation of model (9) with known coefficients \(\left\{ \hat{\varvec{\phi }}, \hat{\varvec{\theta }}\right\}\) can be calculated using the algorithms in [22]. In addition, this procedure to obtain \({\hat{\tau }}^2\) can be easily carried out by the functions \(\textsf{arima}\) for estimates \(\left\{ \hat{\varvec{\phi }}, \hat{\varvec{\theta }}\right\}\) and \(\textsf{ARMAacf}\) for \({\hat{\gamma }}_{k}\) of the \(\textsf{R}\) which is a language and environment for statistical computing [23]. The procedure exhibits a satisfactory performance in the simulation studies later in this section, and will be applied to the data analysis in the next section. There are some other approaches to estimate \(\gamma _k\) and thus \(\tau ^2\), for example [24].
Let the matrix \(\varvec{S}\) be the inverse of the \((N+2)\times (N+2)\) band matrix \(\varvec{V}\) defined below:
In particular, the \(2\times 2\) submatrix \({\varvec{S}}_{j}\) of \({\varvec{S}}\) is
Define
and
with
where \(w_{j(u)}\) is defined in (5). It can be shown that under Assumptions 1–4,
which implies that the \(100(1-\alpha )\%\) SCB is constructed as follows: for any \(0\le u\le 1\)
The proof of (11) is a generalization of the proof of Theorem 2.1 of [18] for the autoregressive time series based on Theorem 4 of [21], and the details are omitted.
Simulation studies are performed for \(\{y_t\}_{t=1}^n\) generated from model (1) with constant function \(g=1\) and ten ARMA models: nine ARMA(1,1) and one ARMA(2,2). These models with coefficients in Table 2 are chosen so that they are representatives of ARMA(1,1) with different shapes of autocorrelation functions (ACF). Figure 3 illustrates ACF’s of some of the models. The coefficients of ARMA(2,2) are obtained from the COVID-19 data to be analyzed in the next section—some of the autocorrelation functions are either non-negative or non-positive at any lags, and the others are alternate.
Five hundred sample paths are simulated for each of the three sample sizes \(n=400\), 800, 2000. Two numbers of knots, \(N=n^{0.2}\) and \(N=n^{0.3}\), are chosen for the B-Spline fit. The 95% confidence bands are constructed from (12). From the relative frequencies of the bands that contains g in Tables 2 and 3, the smaller number of knots tends to outperform the larger one. The bands based on \(N=n^{0.2}\) of knots are conservative, in that all the empirical rates are greater than the nominal rate 95%. On the other hand, some of empirical rates based on \(N=n^{0.3}\) appear to be smaller than 95% at \(n=400\), in particular, for the models, such as \(\phi _1=0.4\) and \(\theta _1=-0.8\), the ACF of which is always negative as shown in Fig. 3.
5 COVID-19 Data Analysis
To estimate the trend g, the piecewise linear B-splines is applied to the differenced log case counts \(\{y_t\}_{t=1}^{730}\), and \(N=n^{0.2}\) is used. The estimates \(\{{\hat{g}}(u_t)\}_{t=1}^{730}\) are calculated, which is illustrated by the solid curve in Fig. 1, and the residuals \(\{{\hat{x}}_t\}_{t=1}^{730}\) of the B-splines are obtained. The ARMA model selection follows an iterative procedure recommended by [25] using the criterion Akaike Information Criterion (AIC). In particular, the AIC is calculated for a model with tentative values for the autoregressive order p and the moving-average order q, and then compared with the AIC of the previous chosen model. According to AIC, the ARMA(2, 2) is selected for the residuals \(\{{\hat{x}}_t\}_{t=1}^{730}\) with the estimates and their standard errors being \({\hat{\phi }}_1=0.78\pm 0.12\), \({\hat{\phi }}_2=-0.70\pm 0.10\), \({\hat{\theta }}_1=-0.76\pm 0.10\), \({\hat{\theta }}_2=0.81\pm 0.07\). The sample autocorrelations of the ARMA(2, 2) residuals \(\{{\hat{\epsilon }}_t\}_{t=1}^{730}\) at lags 1–20 are reasonably small and within the 95% confidence intervals in Fig. 4, which indicates that ARMA(2, 2) is a good fit.
In Fig. 1, the dashed curves are the lower and upper bounds of the 95% SCB. The differenced log populations of the male seniors and female seniors, \(y=-0.227\), is taken into account. The age adjustment line \(y=-0.227\) is not completely contained in the 95% SCB, such as those from November 2020 to March 2021. The percentage of the number of time points at which \(y=-0.227\) is below the lower bound is as high as \(49.3\%\), almost half of the times in the study period. The male case count tends to be proportionally greater than the female counterpart.
6 Discussion
A \(95\%\) SCB is proposed to test whether the case counts of senior groups of both genders are significantly different after the surveillance COVID-19 data of the State of Ohio are adjusted according to their population sizes. It is a powerful statistical inference tool that gives a simultaneous band for the trend function of an ARMA time series. The 95% SCB which is constructed based on the B-splines does not contain the constant baseline \(y=-\,0.227\)—sometimes the baseline is above or below the lower bound, which implies that gender is a factor that has a significant impact on the case counts of senior groups of both genders. Although this result coincides with some observation at the beginning of pandemic outbreak, for example [26] and [27], the SCB approach takes nonconstant trend function and correlation into account when considering the data over two years. The biological or medical aspects for such gender imbalance should be further investigated [28, 29].
The data analysis results are valuable to help reduce or even prevent illness from a similar pandemic in the future. However, this research has some limitations. First, the population size adjustment is calculated based on the records in 2019–2020, not in the period of the study. When the more precise population sizes are available, the study should be updated. Secondly, the development of technology makes surveillance data so easily available, but a great deal of work needs to be done to improve the accuracy [30, 31]). The scientific value of a study relies on the quality of its data. Thirdly, the publicly available COVID-19 data overall are very limited. More comprehensive analysis could reveal more useful information to the general public and policy makers when more data on the patients, such as comorbidities, were available.
References
Kim M, Gu Z, Yu S (2021) Methods, challenges, and practical issues of COVID-19 projection: a data science perspective. J Data Sci 19:219–242
Charvadeh YK, Yi GY, Bian Y et al (2022) Is 14-days a sensible quarantine length for COVID-19? Examinations of some associated issues with a case study of COVID-19 incubation times. Stat Biosci 14:175–190
Gaythorpe KAM, Bhatia S, Mangal T et al (2021) Childrens role in the COVID-19 pandemic: a systematic review of early surveillance data on susceptibility, severity, and transmissibility. Sci Rep 11:13903
Nogueira PJ, de Araújo Nobre M, Costa A et al (2020) The role of health preconditions on COVID-19 deaths in Portugal: evidence from surveillance data of the first 20293 infection cases. J Clin Med 9(8):2368
Ullrich A, Schranz M, Rexroth U et al (2021) Impact of the COVID-19 pandemic and associated non-pharmaceutical interventions on other notifiable infectious diseases in Germany: an analysis of national surveillance data during week 1–2016-week 32–2020. Lancet Reg Health Europe 6:100103
Dai M, Liu D, Liu M et al (2020) Patients with cancer appear more vulnerable to SARS-CoV-2: a multicenter study during the COVID-19 outbreak. Cancer Discov 10:783–791
Mi J, Zhong W, Huang C, Zhang W, Tan L, Ding L (2020) Gender, age and comorbidities as the main prognostic factors in patients with COVID-19 pneumonia. Am J Transl Res 12:6537–6548
Mukherjee S, Pahan K (2021) Is COVID-19 gender-sensitive? J Neuroimmune Pharmacol 16:38–47
Qian J, Zhao L, Ye R et al (2020) Age-dependent gender differences in COVID-19 in Mainland China: comparative study. Clin Infect Dis 71:2488–2494
Nguyen H, Shao Q (2021) Risk identification and prediction for COVID-19 mortality. Transl Univ Toledo J Med Sci 9:39–49
Agresti A (2002) Categorical data analysis, 2nd edn. Wiley Interscience, New Jersey
Liu G, Shao Q, Woody J, Lund R (2016) Testing seasonal means in time series data. Environmetrics 27:198–211
Huang JZ (2003) Local asymptotics for polynomial spline regression. Ann Stat 31:1600–1635
Stone CJ, Hansen M, Kooperberg C, Truong Y (1997) Polynomial splines and their tensor products in extended linear modeling (with discussion). Ann Stat 25:1371–1470
Wang L, Yang L (2009) Spline estimation of single-index models. Stat Sin 19:765–783
Xue L, Yang L (2006) Additive coefficient modelling via polynomial spline. Stat Sin 16:1423–1446
Wang J, Yang L (2009) Polynomial spline confidence bands for regression curves. Stat Sin 19:325–342
Shao Q, Yang L (2012) Polynomial spline confidence bands for time series trend. J Stat Plan Inference 142:1678–1689
de Boor C (2001) A practical guide to splines. Springer-Verlag, New York
DeVore RA, Lorentz GG (1993) Constructive approximation: polynomials and splines approximation. Springer-Verlag, Berlin
Shao Q, Yang L (2017) Oracally efficient estimation and consistent model selection for auto-regressive moving average time series with trend. J R Stat Soc (Ser B) 79:507–524
Brockwell PJ, Davis RA (1991) The analysis of time series: theory and methods, 2nd edn. Springer, New York
R Core Team (2022) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org/
Hall P, Keilegom IV (2003) Using difference-based methods for inference in nonparametric regression with time series errors. J R Stat Soc (Ser B) 65:443–456
Box GEP, Jenkins GM, Reinsel GC (1994) Time series analysis: forecasting and control, 3rd edn. Prentice-Hall International, New Jersey
Shao Q, Thompson G, Thompson A (2020) COVID-19 risk factor identification based on Ohio data. Transl Univ Toledo J Med Sci 8:6–14
Zheng S, Qin J, Zhou Y (2021) Assessment of effects of age and gender on the incubation period of COVID-19 with a mixture regression model. J Data Sci 19:253–268
Dangis A, De Brucker N, Heremans A, Gillis M, Frans J, Demeyere A, Symons R (2020) Impact of gender on extent of lung injury in COVID-19. Clin Radiol 75(7):554–556
Giacomelli A, De Falco T, Oreni L et al (2021) Impact of gender on patients hospitalized for SARS-COV-2 infection: a prospective observational study. J Med Virol 93:4597–4602
Alamo T, Reina DG, Mammarella M, Abella A (2020) Covid-19: open-data resources for monitoring, modeling, and forecasting the epidemic. Electronics 9:827
Tarantola D, Dasgupta N (2021) COVID-19 surveillance data: a primer for epidemiology and data science. Am J Public Health 111:614–619
Acknowledgements
The author would like to thank referees for careful reading and constructive suggestions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author does not have any financial interests that are directly or indirectly related to the work submitted for publication.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shao, Q. Simultaneous Confidence Band Approach for Comparison of COVID-19 Case Counts. Stat Biosci 15, 372–383 (2023). https://doi.org/10.1007/s12561-023-09364-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-023-09364-y