1 Introduction

The COVID-19 pandemic shock represented a once-in-a-generation challenge to both the global economy and business forecasting, and contributes to elevated uncertainty through the present day. In addition to the severity of the shock itself—the 31.2% annualized drop in GDP in Q2 2020 was the largest in data stretching back to 1947, and the 14.7% unemployment rate recorded in April 2020 was the largest in monthly data dating back to January 1948—the policy response to the shock upended a number of empirical regularities observed in pre-pandemic cycles. For example, via multiple rounds of fiscal support, real disposable personal income rose significantly during the pandemic recession, in contrast to experience of prior recessions (see Fig. 1). Additionally, the Federal Reserve implemented significant monetary policy accommodation to support the economy during the pandemic, including lowering the federal funds rate to its effective lower bound and expanding the asset side of its balance sheet from $4 trillion prior to the pandemic to nearly $9 trillion by March 2022. The unprecedented volatility in the economy left businesses scrambling to adjust their operations, and business economists scrambling to recalibrate forecasts and understand the tremendous shifts in the data.

Fig. 1
figure 1

Real disposable personal income (SAAR, Bil. Chained 2012$)

Ho (2021a, b) surveys various approaches to forecasting during COVID. During the episode, forecasters faced key dilemmas, including how to model the nature of the COVID shock (for example, as a period of high volatility versus a structural break in relationships between macroeconomic variables) and how to think about the persistence of the shock (for example, forecasters debated whether the subsequent recovery would be “U-shaped,” “V-shaped,” “L-shaped,” etc.). Research in this field is active and ongoing (Lenza and Primiceri 2020, Primiceri and Tambalotti 2020, Foroni et al. 2020; Ng 2021).

In this article, we take the perspective of an applied business economist. We perform a retrospective evaluation of some of the workhorse statistical models used by business forecasters to see which approaches were most resilient during the early stages of the pandemic shock. We find projection-based approaches were more resilient to the pandemic shock than iteration-based forecasts in the cases we studied. We also find that the pandemic induced high variation in forecast performance among the models which incorporate macroeconomic data. The reliability of such models is sensitive to the extent to which the outcome variable in the forecaster’s industry is representative of broader economic trends.

Given the volatility in standard data during the pandemic, many economists and forecasters turned toward nonstandard, high-frequency data to glean insights about the economy (Ryssdal and Hollnhorst 2021; McCracken 2020). We find that simply incorporating alternative high-frequency data into standard models did not necessarily improve forecast performance, however more research is needed to assess the extent to which these indicators improved business planning. Our results are in line with those of Schorfheide and Song (2021) who find mixed results when incorporating the data into their forecast models, with the baseline model performing poorly during the trough of the pandemic.

The remainder of the paper proceeds as follows. In the next section, we discuss the data and various forecasting approaches that we compare. Section 3 presents results of the assessment, and Sect. 4 concludes.

2 Data and models

2.1 Data

In our main experiment, we take the perspective of an economist in the auto industry and simulate real-time forecasting for monthly total light vehicle sales (including autos and light trucks) in the United States. Data are available at monthly frequency from the Bureau of Economic Analysis starting in 1967 and accessed via the Haver Analytics database.

To test the generalizability of the results, we repeat the experiment for an outcome variable in another industry which may have experienced different supply and demand conditions during the pandemic: industrial production of information processing and related equipment (IP-IPRE). The category includes computers and peripheral equipment along with office, photocopy, communication and other related equipment. Data are available at monthly frequency from the Federal Reserve Board of Governors starting in 1967, accessed via Haver Analytics.

Summary statistics for both series are given in Table 1.

Table 1 Summary statistics for light vehicle sales (thousands) and IP: Information Processing and Related Equipment (index, 2017 = 100), Jan. 1967–Jan. 2022

Figure 2 shows both series along with real GDP, with all three series rescaled so that 2018 equals 100. The figure shows that during the pandemic recession of 2020, both real GDP and IP-IPRE fell and then subsequently recovered, with a slightly more volatile recovery for IP-IPRE.

Fig. 2
figure 2

Light Vehicle Sales, 1967–present and Industrial Production: Information Processing and Related Equipment, 1967–Present

In contrast, the initial drop in light vehicle sales was much larger in percentage terms relative to the declines in GDP and IP-IPRE, consistent with consumers postponing big-ticket durable goods purchases during the recession. With the pandemic causing health concerns and shuttering many service businesses, along with pandemic fiscal support raising households’ disposable income, consumers shifted spending from services to goods beginning in summer 2020, and light vehicle sales experienced a sharper recovery compared to GDP. In the spring of 2021, semiconductor shortages and other supply chain bottlenecks weighed on automotive deliveries and production and sales fell again, with signs of a recovery starting in 2022. As a result, the auto sector’s recovery from the 2020 COVID recession has been markedly different from the recovery in broad GDP, which will have implications for the forecasting exercise described in the next section.

There are a couple of important caveats to our experiment when comparing against the real-world experience of a business forecaster. First, in real-time the forecaster is constrained by the economic data release schedule. For example, a business economist preparing a forecast at the end of March to support a sales planning meeting at the beginning of April will have readings for March’s consumer sentiment index, but consumer price index readings for March will not be available until the middle of April. In practice, many economists will use external forecasts or some statistical model to fill in this “jagged edge” of data. A second caveat is that the forecaster will also only have access to data as reported, while a number of series, such as nonfarm payrolls, can be revised significantly between the initial report and subsequent data releases.

In this article, we abstract from these data constraints, giving our business economist perfect foresight into the current month’s data (including future revisions), and thus an unfair knowledge advantage over a real-time forecaster. In the next section, we conduct a relative comparison of models that all benefit from this unfair advantage, rather than comparing the models against actual real-time forecasts published during the pandemic. Readers should note that the absolute forecasting performance of the models discussed below may be worse than depicted, due to these real-time data availability constraints.

2.2 Models

We consider a number of forecasting models commonly used in industry, described below. To conduct a real-time forecasting experiment, we use each model to generate out-of-sample forecasts for up to 12 months ahead, starting with sample data through December 2018 and making forecasts from January 2019 through December 2019. Moving forward in time, we then consider sample data through January 2019 and generate forecasts for February 2019 through January 2020, and so forth. At each step, we re-estimate parameters of the model and choose best-fit models according to relevant information criteria, so that a particular forecast model chosen in January 2019 may differ from the model chosen in December 2018. These forecasts are stored and compared versus actual observed outcomes to compare relative forecast accuracy.

2.2.1 ARIMA model (ARIMA)

The first kind of model we consider is a univariate autoregressive integrated moving average (ARIMA) model. For each sample step, we select an appropriate model order using the algorithm outlined in Hyndman and Khandakar (2008) and compute out-of-sample forecasts by iterating forward.

2.2.2 Vector autoregression models (VAR)

The second model we consider is a vector autoregression (VAR) which relates each variable in the system to lags of itself as well as lags of the other variables in the model, allowing it to capture feedback loops and interdependence between the endogenous variables. Following Stock and Watson (2002), along with the relevant outcome variable (annualized monthly growth of either light vehicle sales or IP-IPRE), we include monthly CPI inflation and the monthly change in the 90-day U.S. Treasury bill rate. The lag order of the VAR was chosen by selecting the order \(p \in \left[ {1,10} \right]\) which generated the lowest Bayesian information criterion (BIC). As in the case of the ARIMA model, for each sample step we compute out-of-sample forecasts by iterating the VAR forward.

2.2.3 VAR model with diffusion indexes (VAR-DI)

We also consider an alternative VAR model which incorporates information from a large set of macroeconomic indicators using the methods outlined in Stock and Watson (2002), rather than including a limited number of selected indicators. We use principal components analysis to extract the first two dynamic factors from a set of 181 monthly macroeconomic predictors available from January 1980 through January 2022. Stock and Watson (2002) interpret these factors as diffusion indexes measuring common movements in macroeconomic variables. Further details on the underlying predictor series are provided in Appendix 1. At each sample step using data through time t, we recompute factors using principal components analysis on the macroeconomic data. We then estimate a VAR model using these factors along with the relevant outcome variable: auto sales or IP-IPRE. We set the lag order and compute forecasts in the same way as the benchmark VAR.

2.2.4 h-step ahead autoregressive forecast (AR-Proj)

As an alternative to iteration-based forecasts, we could generate multistep forecasts directly by projecting future outcomes at time \(t + h\) onto data available at time \(t\). As a benchmark for this approach, we first consider a model that only uses lags of the variable to be forecasted to predict outcomes at time \(t + h\). The general forecasting equation is:

$$y_{t + h|t}^{h} = \alpha_{h} + \mathop \sum \limits_{j = 1}^{\rho } \gamma_{hj} y_{t - j + 1} + \varepsilon_{t + h}^{h}$$
(1)

For each sample step using data through time \(t\), we fit a forecasting model for horizons \(h \in \left[ {1,12} \right]\) with the lag order of the model \(\rho \in \left[ {0,6} \right]\) determined by BIC, where \(\rho = 0\) indicates that \(y_{t + h|t}^{h}\) is being projected onto a constant term only.

2.2.5 h-step ahead autoregressive forecast with diffusion indexes (AR-DI-Proj)

Another model specification we consider is an extension of the AR-Proj model that uses current and lagged values of \(y_{t}\), along with current and lagged values of the diffusion indexes, to predict \(y_{t + h}\). The forecasting equation is extended to reflect the diffusion indexes:

$$y_{t + h|t}^{h} = \alpha_{h} + \mathop \sum \limits_{j = 1}^{m} \beta_{hj}^{^{\prime}} F_{T - j + 1} + \mathop \sum \limits_{j = 1}^{\rho } \gamma_{hj} y_{t - j + 1} + \varepsilon_{t + h}^{h}$$
(2)

where \(F_{t}\) is the vector of factors whose estimation was described in the VAR-DI model section. At each step using data through time \(t\), we recompute factors using principal components analysis on the extended dataset of observables. We then fit a forecasting model for horizons \(h \in \left[ {1,12} \right]\) with the lag orders of the model \(\rho \in \left[ {0,6} \right]\) and \(m \in \left[ {1,3} \right]\) determined by BIC, where \(\rho = 0\) indicates \(y_{t + h|t}^{h}\) is being projected onto \(F_{t}\) and its lags (if selected) only.

2.3 Do high-frequency data improve forecasts?

As a final exercise, we conduct a simple test of whether incorporating high-frequency data would have improved forecast accuracy during the pandemic. To do this, we add an additional variable to the AR-Proj and AR-DI-Proj models that incorporates information from weekly high-frequency data. The regression equations take the form of:

$$y_{t + h|t}^{h} = \alpha_{h} + \mathop \sum \limits_{j = 1}^{m} \beta_{hj}^{^{\prime}} F_{T - j + 1} + \mathop \sum \limits_{j = 1}^{\rho } \gamma_{hj} y_{t - j + 1} + \delta W_{t} + \varepsilon_{t + h}^{h}$$
(3)

where \(W_{t}\) represents the high-frequency data variable, and \(\beta_{hj}^{{}} = 0\) for the version of the model that does not include the diffusion indexes.

The high-frequency variable we use is the Weekly Economic Index (WEI) published by the Federal Reserve Bank of New York (Lewis et al. 2020). The WEI represents the common component of ten different daily and weekly series, including same-store retail sales, unemployment insurance claims, a weekly staffing index, consumer surveys, steel production, electricity output, and other series. The series is scaled to four-quarter GDP growth units. Our regression uses the monthly average of weekly observations of the WEI, with data starting in January 2008, which shortens the available history of data relative to the other models under consideration. As a result, while incorporating high-frequency data might potentially help the forecaster identify upcoming turning points in the series, the shorter historical sample could also result in less precision when identifying the parameters of the statistical model and generating forecasts and prediction intervals.

3 Results

We first provide a simple graphic example to illustrate our basic procedure. Figure 3 shows monthly light vehicle sales forecasts and corresponding 80% prediction intervals that are calculated using data through March 2020 for two of the models discussed above: the benchmark ARIMA model and the AR-Proj model. Actual vehicle sales are shown in the solid black line. The figure indicates that in the early stage of the pandemic, from April to June 2020, the benchmark ARIMA model was more accurate than the AR-Proj model in terms of predicting lower light vehicle sales, but it still underpredicted the extent of the sales decline. By the third quarter of 2020, actual light vehicle sales came much closer to the predictions of the AR-Proj model. The figure shows how forecast accuracy varies by model and over time. Throughout this section, we will discuss relative model performance using mean absolute prediction error (MAPE) as our metric of forecast accuracy.

Fig. 3
figure 3

Light Vehicle Sales and Selected Forecasts

Our first set of results is for the forecast period that starts in January 2019, which captures roughly 1 year of the pre-pandemic period and 2 years of the pandemic period. Results for forecasting monthly light vehicle sales are in Table 2, which shows the MAPE of each of the models under consideration, by forecast horizon. The results are also shown graphically in Fig. 4.

Table 2 Light Vehicle Sales Forecast Accuracy, Full Forecast Period
Fig. 4
figure 4

Light Vehicle Sales Forecast Accuracy, Full Forecast Period

Figure 4 and the first five columns of Table 2 show that among the five forecasting models under consideration the most accurate, as reflected by a lower MAPE, are the AR-Proj and AR-DI-Proj models. In contrast, for iteration-based forecasts (the AR, VAR and VAR-DI) models, MAPE is higher and increases as the forecast horizon increases. Incorporating information about the broader economy via diffusion indexes into the model does not substantially improve forecast accuracy, with MAPE roughly similar between the AR-Proj and AR-DI-Proj models. Indeed, MAPE is slightly higher in AR-DI-Proj forecasts for 8-months ahead and higher compared to the AR-Proj forecast. Among the iteration-based forecasting models, the VAR-DI forecast has a higher MAPE at all horizons relative to the AR and benchmark VAR model. This is in contrast to the findings of Stock and Watson (2002), who find substantial improvement in forecast accuracy of AR-DI-Proj relative to AR-Proj. Figure 4 and the last two columns of Table 2 also show that the models incorporating information from high-frequency economic data did not perform better than their counterparts which omitted the WEI.

How much did the pandemic matter in generating this performance gap across models? Breaking out forecast performance into pre- and post-pandemic periods reveals that forecasting approaches had vastly different levels of resilience to the COVID shock.

Table 3 shows relative forecast accuracy for forecasts of light vehicle sales from January 2019 through January 2020. For all the models under consideration, MAPE was an order of magnitude lower prior to the pandemic. In relative terms, the AR-DI-Proj model and ARIMA model performed the best, and in the pre-pandemic period we obtain the Stock and Watson (2002) result that including diffusion indexes improves the forecast accuracy of AR-DI-Proj relative to the AR-Proj model. Similar to results for the full forecast period, adding high-frequency data to the forecast models does not consistently improve their performance relative to their counterparts without high-frequency data. Though the AR-Proj model with high-frequency data did outperform its counterpart at the 6–11 month horizons, once information from the diffusion indexes are added to the model, the inclusion of high-frequency data does not improve forecasts (Fig. 5).

Table 3 Light Vehicle Sales Forecast Accuracy, pre-pandemic period
Fig. 5
figure 5

Light Vehicle Sales Forecast Accuracy, pre-pandemic period

Table 4 and Fig. 6 show that forecast accuracy for all models significantly deteriorated in the pandemic period starting February 2020. AR-Proj and AR-DI-Proj had superior forecasting performance compared to the iteration-based forecast approach for forecast horizons longer than 3 months, but incorporating diffusion indexes did not significantly improve the forecast. These findings suggest forecasts based on multistep-ahead projection could be more resilient when the economy is experiencing significant turmoil. The last two columns of Table 4 indicate that incorporating high-frequency data did not make forecasts more accurate despite the increased popularity of such data during the pandemic.

Table 4 Light Vehicle Sales Forecast Accuracy, pandemic period
Fig. 6
figure 6

Light Vehicle Sales Forecast Accuracy, pandemic period

3.1 External validity check: industrial production of information processing and related equipment

Are these findings unique to the auto sector? To test the generalizability of our results, we replicate this exercise for the information technology sector, generating forecasts of the industrial production index for information processing and related equipment (IP-IPRE). At first glance, this category, which includes computers, photocopiers, and scientific and medical equipment, may appear to be just as susceptible to the pandemic semiconductor shortage as the auto sector. However, as noted in a 2021 White House report on supply chain resilience, during the pandemic “semiconductor suppliers shifted production and foundry orders away from automotive-grade chips where demand was falling to business and consumer electronics chips where demand was spiking,” which allowed IP-IPRE to be more resilient to COVID -related disruption relative to auto production (United States 2021).

Indeed, as shown in Fig. 2, industrial production in the sector was more resilient to the COVID shock compared to auto sales, with a rise in remote work driving demand for the sector’s goods. Additionally, the subsequent recovery in the sector was similar to the recovery for overall GDP.

Figure 7 and the first five columns of Table 5 show forecast accuracy for forecasting IP-IPRE. Compared to our forecasts for light vehicle sales, forecast MAPE across all five models was smaller. However, in line with our findings for the auto sector, the AR-Proj and AR-DI-Proj had superior forecast performance overall. Also similar to the results for the auto sector, for the full forecast period the addition of diffusion indexes did not improve the forecast performance of the AR-DI-Proj and VAR-DI models relative to the respective version omitting diffusion indexes. The last two columns of Table 5 indicate that including high-frequency data did not improve forecast accuracy of the AR-Proj and AR-DI-Proj models.

Fig. 7
figure 7

IP-IPRE forecast accuracy, full forecast period

Table 5 IP-IPRE Forecast Accuracy, full forecast period

Figures 8 and 9 shows forecast accuracy for the pre- and post-pandemic periods, respectively. Similar to the findings in the main section, forecast accuracy fell during the pandemic period, with projection-based forecast approaches faring better during the pandemic. AR-Proj and AR-DI-Proj performed similarly to each other in both the pre-pandemic and post-pandemic periods. This is in contrast to the worse performance for AR-DI-Proj for light vehicle sales in the pandemic period, as auto sales were buffeted by sector-specific shocks that made the auto recovery track differently from the broader economy during the pandemic. Figures 8 and 9 also show that incorporating high-frequency data did not consistently improve the forecast performance of the AR-Proj and AR-DI-Proj models, in line with our findings from the auto sector.

Fig. 8
figure 8

IP-IPRE Forecast Accuracy, pre-pandemic period

Fig. 9
figure 9

IP-IPRE Forecast Accuracy, pandemic period

4 Conclusion

What lessons can we take away from this exercise? A couple come to mind:

  1. 1.

    For the auto industry and goods manufacturing in the information sector, projection-based forecasting strategies were more resilient than iteration-based forecasts in the aftermath of the COVID shock. However, model performance varied between the pre-pandemic and pandemic periods, and results may vary when forecasting other economic variables. This highlights that there is no substitute for the economist’s judgment in the forecasting process, particularly when it comes to model selection and specification. Ho (2021) lays out best practices for incorporating subjective judgement in forecasting, including clearly communicating assumptions and imposing assumptions probabilistically.

  2. 2.

    For models incorporating diffusion indexes which isolate underlying factors which are “common” across a wide range of indicators, forecast accuracy may drop in industries that are being driven by idiosyncratic and uncommon forces. This was the case for auto sales, where the post-recession recovery followed a very different trajectory from the trajectory of broad GDP. Knowing if the industry one is trying to forecast has been affected by significant idiosyncrasies could be helpful in choosing which variables to include in a model, and which model’s predictions to put more weight on.

  3. 3.

    High-frequency indicators of economic activity gained a lot of attention during COVID but we show that simply incorporating information from high-frequency data into standard forecasting tools doesn’t necessarily improve forecast accuracy. This doesn’t mean such indicators are useless: these data may have been invaluable in helping craft a convincing narrative around a specific forecast (which can be half the battle for a business economist), as well as providing empirical support for forecasts calling for turning points and inflection points in the economy. However, more research is needed to understand the best way to leverage these data in econometric models.

Finally, while in this paper we focused on assessing the forecast performance of a business economist’s basic toolkit, other models which may currently be underutilized in the private sector could provide promising avenues for further investigation. These include dynamic factor models using mixed frequency data (Kim and Yoo 1995; Camacho et al. 2014), models with time-varying coefficients (Cogley and Sargent 2005, Del Negro and Primiceri 2015), and factor models with time-varying factor loadings (Wei and Zhang 2020). Further research on the performance of these methods in an applied context could help make these approaches a valuable addition to the business forecaster’s toolkit.