1 Introduction

The outbreak of the coronavirus disease (COVID-19) in late 2019 has profoundly changed all walks of life and has raised many challenging research questions over the development of the pandemic. It is the first such pandemic outbreak since the 1918 influenza, commonly known by the misnomer “Spanish flu”. It is the first time in human history that a huge amount of data related to a pandemic has been recorded, which makes it possible to analyze the biological, medical and social aspects of COVID-19, and which provides statisticians and data scientists with tremendous yet challenging opportunities for COVID-19 studies [1].

There have been some attempts to obtain valuable information for society to help identify risk factors and gain an understanding of the coronavirus, in spite of there being so far a limited amount of publicly available data. The primary goal of the paper is to propose an appropriate statistical learning method, using publicly available COVID-19 data, to obtain information about how the coronavirus affects male and female seniors differently.

Many aspects of this virus have been studied based on publicly accessible data which are updated daily from all over the world [2,3,4,5]. In particular, the effects of gender and age on COVID-19 have received a great deal of attention. Senior people tend to have many comorbidities which are associated with most of the senior deaths [6,7,8,9], and the coronavirus tends to exacerbate preexisting health problems. In addition, gender seems to play a role too, and male tends to be more vulnerable. The paper [10] presented the age and gender effects on the fatality rate using the COVID-19 data of the State of Ohio. The authors applied logistic regression to identify factors on the death probability of a COVID-19 patient and concluded that age and gender were significant factors. One of the critical assumptions for logistic regression is that the log odds ratio can be modeled by a linear function [11]. However, many unexpected events, such as vaccination that began in early 2021 and the delta variant that appeared in June–July 2021, could have changed the data generating mechanism. In other words, the linear assumption of logistic regression is too restrictive.

This study focuses on the senior case count comparison between April 1, 2020 and March 31, 2022, inclusive, and aims at investigating whether the impact of male and female on the case count is significant by modeling the difference of log transformed case counts of the two genders \(\{y_{t}\}_{t=1}^n\) after population size adjustment. The paper [12] proposed a correction in the analysis of variance test for autoregressive moving-average time series (ARMA). However, the time series \(\{y_{t}\}_{t=1}^n\) are not only correlated due to the nature of a contagious disease, but also contain a nonlinear trend due to several unexpected events, such as vaccinations and the appearance of the delta variant. The dots in Fig. 1 are the observations of \(\{y_{t}\}_{t=1}^n\), which do not unveil any pattern of the trend or of the mean function that can be easily described by an analytical function. This observation about the data plotted in Fig. 1 suggests that a nonparametric approach, such as splines, is more appropriate. B-spline estimation has been widely applied to describe an unknown mean function [13,14,15,16]. In particular, [17] proposed a simultaneous confidence band (SCB) approach for the mean function based on B-spline estimation, and [18] extended it to autoregressive time series with unknown nonlinear trend function.

Fig. 1
figure 1

Scatter plot for \(\{y_t\}_{t=1}^{730}\) and 95% SCB

The paper extends the confidence band approach of AR to a more general ARMA setting, and applies it to the COVID-19 senior daily case count comparison of males and females in the State of Ohio. The paper is organized as follows. In Sect. 2, the details of the data set to be analyzed are provided; in Sect. 3, the B-splines trend function estimation is reviewed; in Sect. 4, the construction of confidence bands is described, and some simulation studies are presented; in Sect. 5, the confidence band approach is applied to the senior COVID-19 data; in Sect. 6, the paper concludes with some discussion.

2 Data

The State of Ohio established a websiteFootnote 1 which provides the public with information and data about COVID-19 in the state. The daily case count data of male and female seniors (at least 60 years old) from April 1, 2020 to March 31, 2022, inclusive, were downloaded from the website. There are \(n=730\) daily case counts for each group, and there is no missing data or zero count during this period. According to the summary of the statistics in Table 1, the female case count has larger values in all major statistics, such as mean and median. This case discrepancy might be resulted in by the different of the populations of these two groups—the male senior population is smaller than the female counterpart in Ohio. Define a time series \(\{y_t\}_{t=1}^{n}\) as the differenced log transformed daily case counts of the two groups: male senior COVID-19 patients and female senior COVID-19 patients. The scatter plot in Fig. 1 shows that \(\{y_t\}_{t=1}^{n}\) exhibits some nonlinear pattern, and tends to be negative, or the male senior group seems to have a lower case count than the female counterpart.

Table 1 Summary of statistics for case counts

The population difference is taken into account and calculated according to the information in the State of Ohio in 2019–2020.Footnote 2 The horizontal dashed line in Fig. 1 is \(y=-\,0.227\), which is the differenced log male and female senior populations. Most of the observations \(\{y_t\}_{t=1}^{n}\) are above the line \(y=-\,0.227\), which suggests that the male senior group has a higher case rate or are more vulnerable to COVID-19. However, in Fig. 1, the case count of the male senior group outnumbered the case count of the female senior group at the first peak which occurred in the beginning of the pandemic on April 16, 2020, but the opposite situation happened at the second large peak more than a year later on June 19, 2021. Therefore, it is not obvious which gender group is more vulnerable, and the confidence band approach proposed in this paper will provide rigorous statistical evidence to this end.

3 B-spline Trend Estimation

The proposed confidence band is constructed based on a B-spline estimator for the mean function or trend g. B-spline estimation for functions is one of the nonparametric methods that are widely applied. We start with the B-spline estimation and provide the details about how to construct a \((1-\alpha )\%\) simultaneous confidence band.

A realization \(\{y_t\}_{t=1}^{n}\) of a time series is assumed to follow the model below:

$$\begin{aligned} y_t=g(u_t)+x_t, \end{aligned}$$
(1)

where \(g(u_t)\) is a smooth trend function with \(u_t=t/n\) and \(\{x_t\}\) is an autoregressive and moving-average time series with orders p and q (ARMA(p, q)) with mean \(\textsf{E}(x_t)=0\) and covariance \(\gamma _k=\textsf{Cov}(x_t, x_{t-k})\)

$$\begin{aligned} x_t-\sum _{k=1}^p\phi _kx_{t-k}=\epsilon _t+\sum _{k=1}^q\theta _{k}\epsilon _{t-k}. \end{aligned}$$
(2)

In the ARMA model (2) above, \(\{\epsilon _t\}\) is a white noise sequence; that is, mean \(\textsf{E}(\epsilon _t)=0\) and variance \(\textsf{Cov}(\epsilon _t, \epsilon _{t-k})=\sigma ^2\) if \(k=0\) and 0 otherwise. Hereafter, bold letters are used for vectors. The default of a vector is column, and a superscript T is used sometimes for the transpose operation of a matrix. For example, \(\varvec{y}=(y_{1}, \ldots , y_{n})^{T}\) is a vector that is corresponding to all the observations.

This paper considers the case that the unknown trend function g is estimated by degree one B-splines or piecewise linear B-splines for two reasons: a higher degree B-spline estimator does not provide much more gain; a rigorous proof for a confidence band based on the piecewise linear B-splines can be obtained. Let N be an integer, and let \([-h, 1]\) be divided into \((N+2)\) equally spaced subintervals \(J_j=\left[ jh, (j+1)h\right) , j=-1, 0, \ldots , N-1\) and \(J_N=\left[ Nh, 1\right]\), of width \(h=(N+1)^{-1}\). The integer N is a tuning parameter.

The basis of a piecewise linear spline space is \(\varvec{b}(u)=(b_{-1}(u), \ldots , b_{N}(u))^T\) with \(b_{j}(u)\) being defined as follows:

$$\begin{aligned} b_{j}(u)={\left\{ \begin{array}{ll} u/h-j(u), &{}j=j(u), \\ j(u)+1-u/h, &{} j=j(u)-1, \\ 0, &{} \hbox { Otherwise}, \end{array}\right. } \end{aligned},$$
(3)

where j(u) is the location index of u for \(u\in J_{j(u)}\). Any \(b_{-1}(u)\) is not zero in at most two adjacent subintervals. Figure 2 is an example of the degree one B-splines \(\{b_{j,1}(u)\}_{j=-1}^3\), where \(h=0.25\) and \(N=3\). The standardized basis \(\varvec{c}(u)=(c_{-1}(u), \ldots , c_{N}(u))^T\) is defined as

$$\begin{aligned} c_{j}(u) =\frac{b_{j}(u)}{\left\{ \int _0^1b_{j}^2(u)\mathrm{{d}}u\right\} ^{1/2}}= {\left\{ \begin{array}{ll} \frac{u/h-j}{w_{j}\sqrt{h/3}}, &{}j=j(u),\\ \frac{j+2-u/h}{w_{j}\sqrt{h/3}},&{} j=j(u)-1, \\ 0, &{}\text { Otherwise } \end{array}\right. } \end{aligned}$$
(4)

with

$$\begin{aligned} w_j={\left\{ \begin{array}{ll} \sqrt{2}, &{} 0\le j\le N-1, \\ 1, &{} j=-1, N. \end{array}\right. } \end{aligned}$$
(5)

Next, define vectors \(\{\varvec{c}_{j}\}_{j=-1}^N\) for a realization of time series \(\varvec{y}\) by \(\varvec{c}_{j}\) = \((c_{j}(u_1),\) \(\ldots\), \(c_{j}(u_n))^{T}\), and a \(n\times (N+2)\) matrix \({\varvec{C}}=(\varvec{c}_{-1}, \ldots , \varvec{c}_N)\). The B-spline estimator \({\hat{g}}\) is a linear combination of \(\varvec{c}_{j}\)

$$\begin{aligned} {\hat{g}}=\varvec{C}\hat{\varvec{\beta }}, \end{aligned},$$

where the coefficients \(\hat{\varvec{\beta }}=({\hat{\beta }}_{-1}, \ldots , {\hat{\beta }}_N)^T\) minimizes

$$\begin{aligned} (\varvec{y}-\varvec{C}\varvec{\beta })^T(\varvec{y}-\varvec{C}\varvec{\beta }). \end{aligned}$$

Thus from the classic theory of the least squares we can obtain for \(u \in [0, 1]\),

$$\begin{aligned} {\hat{g}}(u)= & {} \varvec{c}^{T }(u) \left( \frac{1}{n}{\varvec{C} }^{T }{\varvec{C}}\right) ^{-1}\left( \frac{1}{n}{\varvec{C}} ^{T }\varvec{y}\right) . \end{aligned}$$
(6)

More detailed descriptions about B-splines and their applications in Statistics can be found, for example, in [19] and [20].

Fig. 2
figure 2

Degree one B-splines \(\{b_{j}(u)\}_{j=-1}^3\)

4 Construction of Confidence Band

A simultaneous confidence band is constructed from the residuals of the trend estimate in (6). There are two major steps: in the first step, \({\hat{g}}\) is calculated by (6), and residuals \(\{{\hat{x}}_{t}\}_{t=1}^{n}\) are calculated from \({\hat{x}}_{t}=y_{t}-{\hat{g}}(u_t)\); in the second step, the parameters and estimates needed for the confidence band are obtained based on the matrix \(\varvec{C}\) and \(\{{\hat{x}}_{t}\}_{t=1}^{n}\), and thus a \((1-\alpha )\%\) SCB is constructed. For an integer \(m\ge 0\) and \(\nu \in \left[ 0,1\right]\), denote by \(C^{\left( m,\nu \right) }\left[ 0,1\right]\) the space of functions whose m-th derivatives satisfy Hölder conditions of order \(\nu\), i.e.,

$$\begin{aligned} C^{\left( m,\nu \right) }\left[ 0,1\right] =\left\{ \phi :\left[ 0,1 \right] \mathbb {\rightarrow R}\left| \left\| \phi \right\| _{m,\nu }=\sup _{0\le x<y\le 1}\frac{\left| \phi ^{\left( m\right) }\left( x\right) -\phi ^{\left( m\right) }\left( y\right) \right| }{\left| x-y\right| ^{\nu }}<+\infty \right. \right\} . \end{aligned}$$
(7)

The following assumptions are necessary for the proposed \((1-\alpha )\%\) SCB for a smooth trend function g(u).

  1. 1.

    The trend function \(g(\cdot )\in C^{\left( m,\nu \right) }\left[ 0,1\right]\), \(m<2\).

  2. 2.

    The number of interior knots N satisfies \(n^{1/4\left( m+\nu \right) }\ll N\ll n^{1/2}\).

  3. 3.

    The parameter space \(\Xi\) is compact and consists of \(\varvec{\alpha }\) such that all roots of \(\Phi ({\varvec{\alpha }},z)=0\) and \(\Theta ({\varvec{\alpha }},z)=0\) are larger than one in absolute value, and they have no common roots. The true parameter value \({{{\varvec{\alpha }}_{0}}}\) is in the interior of the parameter space \(\Xi\).

  4. 4.

    The innovations \(\left\{ \epsilon _{t}\right\} _{t=-\infty }^{\infty }\) are i.i.d. with \(\hbox {E}(\epsilon _{1}^{4})<\infty\).

These assumptions ensure that some uniform consist properties of \({\hat{g}}\) as an estimate of g and of the residuals \({\hat{x}}_t\) as a substitute for unobservable ARMA time series \(x_t\). According to [21], under assumptions 1–2, it is straightforward to obtain

$$\begin{aligned} \max _{1\le t\le n}\hbox {E}\left\{ g(u_{t})-{\hat{g}}(u_{t})\right\} ^{2}= & {} o\left( n^{-1/2}\right) ; \end{aligned}$$
(8)

according to [22], under assumption 3, a ARMA model is causal and invertible; under assumptions 1–4, the asymptotic properties of the maximum likelihood estimators for ARMA coefficients are not altered using the residuals \(\{{\hat{x}}_{t}\}_{t=1}^{n}\)

$$\begin{aligned} {\hat{x}}_t=y_t-{\hat{g}}(u_t) \end{aligned}$$

as a substitute.

One of the parameters needed in the confidence band is the zero spectrum \(\tau ^2\) of the ARMA series \(\{x_{t}\}_{t=1}^{n}\):

$$\begin{aligned} \tau ^2=\gamma _0+2\sum _{k\ge 1}\gamma _{k}, \end{aligned},$$

where \(\gamma _{k}\) is the autocorrelation at lag k, \(\gamma _{k}=\textsf{Cov}(x_t, x_{t+k})\). One can obtain the estimate \({\hat{\gamma }} _{k}\) of autoregressive covariance at any lag k by the sample autocovariance based on \(\{{\hat{x}}_t\}_{t=1}^n\) as follows

$$\begin{aligned} {\hat{\gamma }}_{k}=\frac{1}{n}\sum _{t=k+1}^n{\hat{x}}_{t}{\hat{x}}_{t-k}. \end{aligned}$$

However, it requires a large sample size. For the COVID-19 data to be analyzed, \(n=730\), and the method introduce below works better.

Another method to obtain \({\hat{\gamma }}_{k}\) is based on the residual sequence \(\{{\hat{x}}_{t}\}_{t=1}^{n}\). In particular, \(\{{\hat{x}}_{t}\}_{t=1}^{n}\) is treated as the replacement of unobservable time series \(\{x_{t}\}_{t=1}^{n}\), and the maximum likelihood estimates \(\hat{\varvec{\phi }}=({\hat{\phi }}_{1},\ldots , {\hat{\phi }}_{p})^T\) is calculated for the autoregressive coefficients and \(\hat{\varvec{\theta }}=({\hat{\theta }}_{1}, \ldots , {\hat{\theta }}_{q})^{T}\) for the moving-average coefficients. Since these estimators \(\left\{ \hat{\varvec{\phi }}, \hat{\varvec{\theta }}\right\}\) are consistent according to [21], one can obtain the estimate \({\hat{\gamma }}_{k}\) from the ARMA model:

$$\begin{aligned} {\hat{x}}_t-\sum _{k=1}^p{\hat{\phi }}_k{\hat{x}}_{t-k}={\hat{\epsilon }}_t+\sum _{k=1}^q{\hat{\theta }}_{k}{\hat{\epsilon }}_{t-k}, \end{aligned},$$
(9)

where \(\{{\hat{\epsilon }}_t\}_{t=1}^n\) is the residuals of the ARMA(p, q) time series model. The theoretical autocorrelation of model (9) with known coefficients \(\left\{ \hat{\varvec{\phi }}, \hat{\varvec{\theta }}\right\}\) can be calculated using the algorithms in [22]. In addition, this procedure to obtain \({\hat{\tau }}^2\) can be easily carried out by the functions \(\textsf{arima}\) for estimates \(\left\{ \hat{\varvec{\phi }}, \hat{\varvec{\theta }}\right\}\) and \(\textsf{ARMAacf}\) for \({\hat{\gamma }}_{k}\) of the \(\textsf{R}\) which is a language and environment for statistical computing [23]. The procedure exhibits a satisfactory performance in the simulation studies later in this section, and will be applied to the data analysis in the next section. There are some other approaches to estimate \(\gamma _k\) and thus \(\tau ^2\), for example [24].

Let the matrix \(\varvec{S}\) be the inverse of the \((N+2)\times (N+2)\) band matrix \(\varvec{V}\) defined below:

$$\begin{aligned} (\varvec{V})_{j, j'=-1}^N= {\left\{ \begin{array}{ll} 1, &{} j=j'; \\ 1/4, &{} |j-j'|=1 \hbox { and }0\le j, j'\le N-1;\\ \sqrt{2}/4, &{} |j-j'|=1 \hbox { and } j \hbox { or }j'=-1, N;\\ 0, &{} \hbox {Otherwise}. \end{array}\right. } \end{aligned}$$
(10)

In particular, the \(2\times 2\) submatrix \({\varvec{S}}_{j}\) of \({\varvec{S}}\) is

$$\begin{aligned} {\varvec{S}}_{j}=\left( \begin{matrix} s_{j-1,j-1} &{} s_{j-1,j} \\ s_{j,j-1} &{} s_{j,j} \end{matrix} \right) . \end{aligned}$$

Define

$$\begin{aligned} d_{\alpha }=1-\left\{ 2\log (N+1)\right\} ^{-1}\left[ \log (\alpha /2)+ \frac{1}{2}\left\{ \log \log (N+1)+\log (4\pi )\right\} \right] , \end{aligned}$$

and

$$\begin{aligned} {\hat{\eta }}_{n}^{2}(u)=\frac{3}{nh}\left( \sum _{|k|=0}^{\lceil nh\rceil } {\hat{\gamma }}_{k}\right) \varvec{\delta }^{T}(u){\varvec{S}}_{j(u)} \varvec{\delta }(u), \end{aligned}$$

with

$$\begin{aligned} \varvec{\delta }(u)=\left( \frac{j(u)+1-u/h}{w_{j(u)-1}},\frac{u/h-j(u)}{ w_{j(u)}}\right)\end{aligned},$$

where \(w_{j(u)}\) is defined in (5). It can be shown that under Assumptions 1–4,

$$\begin{aligned} \liminf _{n\rightarrow \infty }\textsf{P}\left\{ g(u)\in {\hat{g}}(u)\pm 2\{\log (N+1)\}^{1/2}{\hat{\eta }}_{n}(u)d_{\alpha /2}\right\} \ge 1-\alpha , \end{aligned}$$
(11)

which implies that the \(100(1-\alpha )\%\) SCB is constructed as follows: for any \(0\le u\le 1\)

$$\begin{aligned} {\hat{g}}(u)\pm 2\{\log (N+1)\}^{1/2}{\hat{\eta }}_{n}(u)d_{\alpha /2}. \end{aligned}$$
(12)

The proof of (11) is a generalization of the proof of Theorem 2.1 of [18] for the autoregressive time series based on Theorem 4 of [21], and the details are omitted.

Simulation studies are performed for \(\{y_t\}_{t=1}^n\) generated from model (1) with constant function \(g=1\) and ten ARMA models: nine ARMA(1,1) and one ARMA(2,2). These models with coefficients in Table 2 are chosen so that they are representatives of ARMA(1,1) with different shapes of autocorrelation functions (ACF). Figure 3 illustrates ACF’s of some of the models. The coefficients of ARMA(2,2) are obtained from the COVID-19 data to be analyzed in the next section—some of the autocorrelation functions are either non-negative or non-positive at any lags, and the others are alternate.

Table 2 Simultaneous confidence band relative frequencies for ARMA(1,1)

Five hundred sample paths are simulated for each of the three sample sizes \(n=400\), 800, 2000. Two numbers of knots, \(N=n^{0.2}\) and \(N=n^{0.3}\), are chosen for the B-Spline fit. The 95% confidence bands are constructed from (12). From the relative frequencies of the bands that contains g in Tables 2 and 3, the smaller number of knots tends to outperform the larger one. The bands based on \(N=n^{0.2}\) of knots are conservative, in that all the empirical rates are greater than the nominal rate 95%. On the other hand, some of empirical rates based on \(N=n^{0.3}\) appear to be smaller than 95% at \(n=400\), in particular, for the models, such as \(\phi _1=0.4\) and \(\theta _1=-0.8\), the ACF of which is always negative as shown in Fig. 3.

Fig. 3
figure 3

Autocorrelation functions of some ARMA(1,1) models

Table 3 Simultaneous confidence band relative frequencies for ARMA(2,2)

5 COVID-19 Data Analysis

To estimate the trend g, the piecewise linear B-splines is applied to the differenced log case counts \(\{y_t\}_{t=1}^{730}\), and \(N=n^{0.2}\) is used. The estimates \(\{{\hat{g}}(u_t)\}_{t=1}^{730}\) are calculated, which is illustrated by the solid curve in Fig. 1, and the residuals \(\{{\hat{x}}_t\}_{t=1}^{730}\) of the B-splines are obtained. The ARMA model selection follows an iterative procedure recommended by [25] using the criterion Akaike Information Criterion (AIC). In particular, the AIC is calculated for a model with tentative values for the autoregressive order p and the moving-average order q, and then compared with the AIC of the previous chosen model. According to AIC, the ARMA(2, 2) is selected for the residuals \(\{{\hat{x}}_t\}_{t=1}^{730}\) with the estimates and their standard errors being \({\hat{\phi }}_1=0.78\pm 0.12\), \({\hat{\phi }}_2=-0.70\pm 0.10\), \({\hat{\theta }}_1=-0.76\pm 0.10\), \({\hat{\theta }}_2=0.81\pm 0.07\). The sample autocorrelations of the ARMA(2, 2) residuals \(\{{\hat{\epsilon }}_t\}_{t=1}^{730}\) at lags 1–20 are reasonably small and within the 95% confidence intervals in Fig. 4, which indicates that ARMA(2, 2) is a good fit.

Fig. 4
figure 4

ARMA (2, 2) residual sample autocorrelation function

In Fig. 1, the dashed curves are the lower and upper bounds of the 95% SCB. The differenced log populations of the male seniors and female seniors, \(y=-0.227\), is taken into account. The age adjustment line \(y=-0.227\) is not completely contained in the 95% SCB, such as those from November 2020 to March 2021. The percentage of the number of time points at which \(y=-0.227\) is below the lower bound is as high as \(49.3\%\), almost half of the times in the study period. The male case count tends to be proportionally greater than the female counterpart.

6 Discussion

A \(95\%\) SCB is proposed to test whether the case counts of senior groups of both genders are significantly different after the surveillance COVID-19 data of the State of Ohio are adjusted according to their population sizes. It is a powerful statistical inference tool that gives a simultaneous band for the trend function of an ARMA time series. The 95% SCB which is constructed based on the B-splines does not contain the constant baseline \(y=-\,0.227\)—sometimes the baseline is above or below the lower bound, which implies that gender is a factor that has a significant impact on the case counts of senior groups of both genders. Although this result coincides with some observation at the beginning of pandemic outbreak, for example [26] and [27], the SCB approach takes nonconstant trend function and correlation into account when considering the data over two years. The biological or medical aspects for such gender imbalance should be further investigated [28, 29].

The data analysis results are valuable to help reduce or even prevent illness from a similar pandemic in the future. However, this research has some limitations. First, the population size adjustment is calculated based on the records in 2019–2020, not in the period of the study. When the more precise population sizes are available, the study should be updated. Secondly, the development of technology makes surveillance data so easily available, but a great deal of work needs to be done to improve the accuracy [30, 31]). The scientific value of a study relies on the quality of its data. Thirdly, the publicly available COVID-19 data overall are very limited. More comprehensive analysis could reveal more useful information to the general public and policy makers when more data on the patients, such as comorbidities, were available.