1 Introduction

Energy resources are highly valuable to every economy Zhu et al. (2014). Numerous energy researchers, such as Hamilton (2003) and Charles and Darné (2017), suggested that price fluctuations of crude oil may cause significant shocks to the overall economy and equity markets. For example, Jo (2014), Wen et al. (2019) suggested that crude oil price volatility holds the greatest capacity to create economic uncertainty and leads to an increased risk to the energy market in the short term. In addition, the emergence of this risk triggers supplementary energy costs that have damaging effects on the budgets of the country. Thus, market actors should decrease their investments in risky assets such as crude oil commodities.

An agreement among contemporary economists worldwide is to tackle the risk of oil market fluctuations. In response, Charles and Darné (2017) indicated that a better comprehension of the assessment of crude oil prices and their underlying determinants represents the essential first step in dealing with this risk. This shall enable both investors and energy policymakers to establish successful hedging policies and monitor the effects of regulatory measures on the oil market.

In line with this, Ftiti et al. (2020) and Wu et al. (2021) added that an effective assessment of crude oil prices enables traders to improve their confidence and motivate them to increase their investments in risky assets and, hence, realize optimal portfolio management. Globally, this will succeed in achieving equilibrium in global financial markets in the short and long runs. Another method followed by many researchers in energy is to forecast crude oil prices and their fluctuations. This allows traders and policymakers to better determine the precise value of crude oil prices to achieve effective speculation and control of the demand and supply of crude oil at lower costs.

Thus far, numerous studies have developed several tools to forecast crude oil prices. These tools can be subdivided into several categories. The first class focuses on traditional methods, such as autoregressive integrated mobile average models and generalized autoregressive conditional heteroscedasticity (Alquist et al., 2013; Baumeister & Kilian, 2012; Salisu & Fasanya, 2013; Wang & Wu, 2012). This type of tool has the ability to take a linear structure in the relationship between variables. The second class is reserved for artificial intelligence methods such as artificial neural networks (Wang & Wang, 2016). These intelligent models appeared robust against stylized facts in the data such as nonlinearity. The third class comprises heterogeneous autoregressive (HAR) models. Many studies (Chen et al., 2020; Liu et al., 2018) have shown that HAR models emerged effectively when the data used to forecast crude oil prices is of the high-frequency type. The last class uses hybrid tools to forecast crude oil prices (Ftiti et al., 2020; Zhang et al., 2015). This is the result of the combination of two or more tools. For example, Ftiti et al. (2020) proposed a hybrid method composed of two parts: a linear structure (autoregressive model) and a nonlinear structure (polynomial, saturation, and dead zone forms). The authors affirm that hybrid tools are effective in considering the complex characteristics of the connectedness between crude oil prices and their forecasters. Based on the aforementioned tool classifications, different determinants affecting crude oil prices have been identified; for instance, supply and demand factors (Conrad et al., 2014), political factors (Wu et al., 2021), investor sentiment (Yang et al., 2019), large shocks such as the global financial crisis (Charles & Darné, 2014), uncertainty indexes such as the implied volatility risk index (VIX); economic policy uncertainty (EPU) (Dutta, 2017; Dutta et al., 2021); the Geopolitical Risk index (Nonejad, 2021), and the Equity Market Volatility (EMV) trackers (Dutta et al., 2021).

Extant energy research primarily examines the effects of uncertainty on crude oil prices. This is because of the emergence of various measures of uncertainty related to financial market conditions (CBOEFootnote 1 Volatility Index (VIX)), energy market conditions (CBOE Crude Oil Volatility Index (OVX)), and EPU. As mentioned above, researchers have shown that the variation in these different measures of uncertainty can significantly affect crude oil prices. In addition, traders and policymakers observed that this uncertainty risk should be considered more dangerous when it was accompanied by a financial crisis, as in October 1987 and December 2008, or with a pandemic outbreak, such as the COVID-19 pandemic. Furthermore, these conditions lead to a complex association between crude oil price and uncertainty measures.

Other studies have demonstrated that commodity assets such as gas prices play a leading role in explaining and forecasting crude oil prices (Alquist et al., 2013; Cerqueti et al., 2019; Gatfaoui, 2016). In particular, Chai et al. (2018) showed that the crude oil market is specified by the complexity caused by nonlinearity, regime-switching, and the effects of variations in many explanatory variables. Ftiti et al. (2020) justified that under normal conditions, the complex association between oil prices and factors such as the gas price can be generated by the demand and supply channels. They suggested that a rising crude oil price motivates demanders to substitute natural gas for oil, thereby leading to a higher demand for gas and, consequently, a price increase. Nevertheless, an upward movement in the price of crude oil due to growing oil demand could lead to an increased price of natural gas in three ways: an increase in gas production, an increase in the production costs of natural gas, and in projects that contribute to the effective utilization of natural gas. The authors also showed that the connectedness between crude oil and gas becomes more complex under abnormal conditions. They cited many events such as the 2001 terrorist attacks, the financial crisis of 2007–2009, the Arab Spring, geopolitical factors (Iran–United States), and the Libyan War. These events play a primordial role in generating an increasingly complex association between gas and oil. In line with this, the change in pricing methods and the emergence of other assets in financial and energy markets, accompanied by an investor’s heterogeneity, can also lead to asymmetry and non-linearity; hence, it can represent another source of a complex link between crude oil and gas. In summary, previous studies have attempted to solve the complexity issue in the energy market by providing various forecasting methods and factors. Nevertheless, prediction tools need to be improved to address the disequilibrium caused by uncertainty and complexity.

This research aims to address the gap in the extant literature by considering the forecasting power of uncertainty indices together with the gas price for the crude oil price. To this end, we used a dataset of the West Texas Intermediate Spot Price (WTI), the United States Economic Policy Uncertainty index (EPU), the Chicago Board Options Exchange Volatility Index (VIX), the Chicago Board Options Exchange Crude Oil Volatility Index (OVX), and Henry Hub Natural Gas Spot Price (GP) from May 10, 2007, to August 09, 2021. In addition, we applied a novel artificial intelligence method called the eXtreme Gradient Boosting (XGBoost) method on a daily dataset. Chen and Guestrin (2016) are the developers of this tool. They indicated that the XGBoost method has a higher ability in terms of learning speed and performance compared to other machine learning (ML) tools. It has become more robust to provide good and accurate forecasts and to reduce complexity, especially when data normalization is applied (Ben Jabeur et al., 2021a, 2021b). The notable motivations behind the use of XGBoost soft-computing tools to forecast crude oil prices are as follows. On the one hand, we refer to Ftiti et al. (2020), who indicated that the crude oil price is sensitive to various factors such as the information advent, behavior of participants in the market, and uncertainty in the financial and energy markets. Therefore, the authors also showed that these events were considered the main sources of complex characteristics (e.g., structural evolution, nonlinear structure, time-varying, and non-stationarity) in crude oil prices. Li et al. (2016) noted that the most important challenge facing researchers today is to apply effective tools to achieve good predictive accuracy for crude oil price series, considering the aforementioned complex characteristics. In our study, detecting these stylized facts motivates us to use computational and intelligent models such as the XGBoost approach for forecasting crude oil prices. On the other hand, the XGBoost tool is characterized by lower complexity and costs with higher accuracy and rapid processing times. This allows the combination of various inputs to generate an improvement in forecasting accuracy, together with applying the SHapley Additive explanation (SHAP) to detect the importance of specific features of crude oil prices.

This study contributes to the literature in several ways. First, to the best of our knowledge, no studies have examined the co-effects of uncertainty indices and gas prices on crude oil price forecasts. This study is the first to explore the combined effects of uncertainty indices and gas prices on crude oil prices. Second, the complex pricing of crude oil motivates us to use the XGBoost method to forecast crude oil prices. Motivated by their lower complexity and costs, higher accuracy, and rapid processing times, this study is the first to use the XGBoost method to forecast the price of WTI crude oil. In addition, to verify the superiority of this novel ML method, we compare it against autoregressive integrated moving average models with exogenous input (ARIMAX (p,d,q) and the Support Vector Machine (SVM). The performance of competitors’ models is identified using the performance indicators [root mean square error (RMSE) and mean absolute error (MAE)]. Third, our estimated results offer important implications for policymakers and investors for understanding the link between investments in oil futures contracts and investors’ fear, as well as the effect of this association on the stability of the oil market. This also allows them to discover other forecasters related to uncertainty in crude oil prices using sophisticated ML tools.

Overall, the empirical findings indicate that the proposed XGBoost model outperforms its counterparts (e.g., SVM model and ARIMAX) in accurately forecasting crude oil prices. In addition, the results show that implied volatility dominates gas prices and other uncertainty variables when forecasting crude oil prices. Accordingly, this study’s findings can be used by policymakers to achieve effective policy regulation regarding the stability of government spending and economic growth. In addition, traders can profit from accurate forecasting of crude oil prices to realize efficient portfolio diversification and correct hedging against huge fluctuations in oil prices.

The remainder of this paper proceeds as follows. Section 2 presents the literature review. Section 3 presents the data analysis and methodological frameworks. Section 4 provides a discussion on the main findings. A summary of the main concluding remarks and policy implications is presented in Sect. 5.

2 Literature Review

Over the last three decades, both academicians and policymakers have been interested in the relationship between macroeconomic performance and oil price shocks. According to Sreenu (2018), oil prices are considered an important driver of economic development and GDP growth. Prior research has reported that fluctuations in the oil market are a key determinant of recessions (Hamilton, 1983, 2003). Other studies have concluded that the relationship between oil market shocks and economic performance could be causal (Chen & Chen, 2007; Wang, 2013).

Considering the narrow association between oil price fluctuations and the economic performance of nations, forecasting crude oil prices is crucial. An increasing number of studies are focusing on this topic. This study focuses on the role of gas prices and uncertainty as key factors that affect oil pricing volatility.

According to economic theory, crude oil and natural gas prices are reciprocal. These are considered substitutes for each other in consumption or production. Many empirical studies have proven that the connection between the two commodities is significant (Ftiti et al., 2020; Gatfaoui, 2016; Lu et al., 2014). More specifically, other authors (e.g., Brown & Yücel, 2008) proved that the relationship between two variables is linear. However, many researchers have confirmed a nonlinear relationship. For example, Batten et al. (2017) examined the spillover of time-varying prices between natural gas and crude oil markets from 1994 to 2014. The empirical results reveal that the price of natural gas causes the price of crude oil to spill over, with price spillover effects persisting for up to two weeks. Furthermore, the findings indicate a lower price dependency between the two energy commodities after 2006.

Ftiti et al. (2020) used a hybrid model composed of two parts: a linear model represented by an ARX structure and a nonlinear structure represented by many forms, such as dead zone, saturation, and polynomial structures. The empirical estimates of the Hammerstein-ARX model showed the significant forecasting power of gas prices for crude oil prices.

Less abundant studies have explored the nonlinear link between crude oil and gas prices. For example, Atil et al. (2014) used a nonlinear autoregressive distributed lag (NARDL) model to study the association between oil and gas prices. The empirical findings indicate that there is an asymmetric and nonlinear effect between oil, gasoline, and natural gas prices. However, there are differences in the transmission mechanisms.

Regarding the economic uncertainty-oil price relationship, Baker et al. (2016) and Basu and Bundick (2017) concluded that economic uncertainty is an important factor behind the fluctuations and stability of the macroeconomy. In addition to the macroeconomic dimension, economic uncertainty significantly impacts the financial market (Pastor & Veronesi, 2012) and crude oil market (Van Robays, 2016). Similarly, Yi et al. (2021) tested whether macroeconomic uncertainty could explain and forecast China’s INE crude oil futures market volatility. The empirical findings of the GARCH-MIDAS model show that geopolitical risk and economic policy uncertainty from the UK and Japan could be considered stronger predictors of crude oil future volatility in China.

Aloui et al. (2016) used a copula approach to explore the impact of economic uncertainty on crude oil returns. They found that higher economic policy uncertainty indices positively affect crude oil returns only during certain periods. This positive effect explains the positive dependence before the financial crisis and Great Recession. Using both the WTI and BRENT crude oil prices and performing structural equation modeling, Wang and Sun (2017) studied the determinants of oil price changes. The empirical results show that economic activity is the most significant factor that affects oil pricing volatility. The findings also indicate that wars and political tensions explain oil price fluctuations in the largest oil-producing countries. To investigate the effect of crude oil price uncertainty on investment, Phan et al. (2019) used a dataset of more than 33,000 firms located in 54 countries over the period 1984–2015. The authors found that crude oil price uncertainty negatively influences corporate investment. This negative effect differs from that of crude oil producers on crude consumers. The authors also found that this negative effect is dependent on the firm’s characteristics and stock market development. Based on WTI and BRENT crude oil price data, Yang (2019) explored the correlation between EPU and oil price shocks in the U.S. Empirical findings of the structural vector autoregression framework support a strong connection between crude oil prices and EPU. The author pointed out that the total association of the Brent crude oil price is more pronounced than that of the WTI crude oil price.

Recently, Lyu et al. (2021) studied the over-time impact of global economic policy uncertainty shocks on oil price fluctuations. Two crude oil pricing benchmarks were used: Brent and West Texas Intermediate crude oil prices. The empirical results indicate a time variation in the impact of economic uncertainty on crude oil price fluctuations. This impact is amplified under extreme market conditions, such as the 2008 international financial crisis and the European sovereign debt crisis of 2010–2012. More recently, Dutta et al. (2021) used a new variable called EMV trackers, which measures uncertainty in the financial market. Served by quantile regressions, the authors showed that the new-based EMV tracker represents a good forecaster of crude oil volatility. Wang et al. (2021) used geopolitical risk as a proxy for uncertainty. The estimates of the Markov regime-switching-GARCH model proved that geopolitical risk significantly impacts crude oil volatility in terms of forecasting. Referring to the aforementioned literature, we note that researchers are focusing on the predictive ability of uncertainty factors for oil prices. In addition, an in-depth review of previous works is that the trend has shifted toward using methods that are able to predict oil prices with accuracy and are capable of accounting for the complex characteristics of oil prices, such as the structural evolution, nonlinear structure, time-varying, and non-stationarity. Table 1 reports several studies that used advanced ML models to predict crude oil prices. For example, Ben Jabeur et al., (2021a, 2021b) attempted to predict oil prices during the COVID-19 pandemic. To this end, they used advanced ML methods such as LightGBM, CatBoost, XGBoost, random forest (RF), and neural network models. This study is also based on SHapely Additive exPlanations (SHAP) values. The main findings support the superiority of RF and LightGBM over traditional models. The empirical results also indicate that high values of GER and ESG lead to lower crude oil prices. Li et al. (2021) integrated variational mode decomposition (VMD) and random sparse Bayesian learning (RSBL, SBL-based prediction with random lags and random samples), to forecast crude oil prices. The findings indicate that the proposed VMD-RSBL significantly outperforms several state-of-the-art schemes. Li et al. (2021) introduced a novel multiscale hybrid model for crude-oil price forecasting. The variational mode decomposition method is advantageous because it allows the decomposition of the crude oil price into several simple models. The empirical findings confirm that the proposed model can achieve superior forecasting results. Table 1 also considered other studies that use hybrid tools to forecast crude oil prices. Abdollahi (2020) employed a hybrid model that included complete ensemble empirical mode decomposition, support vector machine, particle swarm optimization, and Markov-switching generalized autoregressive conditional heteroskedasticity. The authors demonstrate the dominance of the proposed Hybrid models over their counterpart’s models in terms of the accurate forecast of crude oil prices. Similar results are found in Ahmad et al. (2021), who documented that the novel hybrid model consisting of median ensemble empirical mode decomposition and the group method of data handling is superior to statistical and ML models in empirical mode decomposition, artificial neural networks, and ARIMA models.

Table 1 Contemporary studies on crude oil price prediction

3 Data and Methodology

3.1 Data Analysis

The data sample contains daily values of the West Texas Intermediate WTI Spot Price (WTI), the United States Economic Policy Uncertainty index (EPU), the Chicago Board Options Exchange Volatility Index (VIX), the Chicago Board Options Exchange Crude Oil Volatility Index (OVX)), and Henry Hub Natural Gas Spot Price (GP). Crude oil and gas prices have been downloaded from the Energy Information Administration website.Footnote 2 The EPU indicates the number of articles in newspapers that appear on the financial regulation, economy, monetary and trade policies, and uncertainty in the United States. This variable is downloaded from the EPU website.Footnote 3 The VIX and OVX data are obtained from the Chicago Board Options Exchange website. Whaley (1993) was the first to develop a VIX to measure uncertainty in the financial market. This is also called “the fear sentiment.” Subsequently, the Chicago Board Options Exchange developed OVX to represent uncertainty in the oil market. The data are downloaded from the CBOE website.Footnote 4 The sample covers the period from May 10, 2007, to August 09, 2021. The choice of the period is justified by the availability and continuity of the data. Thus, we have 3584 observations. The data are separated into two parts. The first part is devoted to model training with 2868 observations. The second part is used for model validation, with 716 observations.

Table 2 plots statistical results for all series data for uncertainty indexes and gas prices. All series have non-normal distribution. This is deduced from the Jarque–Bera statistics that are significant at 1%. Furthermore, the skewness statistics generate values that are different from 0 for all variables. This corroborates the asymmetrical distributions of the series. Additionally, all series have heavier tails than a normal distribution. This is proved by the kurtosis values that are much larger than 3. As well, Fig. 1 presents the correlation matrix between variables. Findings indicate that the oil price (LOP) has a negative and a weaker correlation with all uncertainty indices (LOVX, LVIX, and LEPU). However, the oil price is positively correlated with the gas price (0.60).

Table 2 Descriptive statistics
Fig. 1
figure 1

Correlation matrix

Moreover, LOVX is highly correlated with LVIX (0.713). We then calculate the variance inflation factor to spot the multicollinearity which is reported in Table 3. The results show that the VIF is lower than 5 and 10 and there is no multicollinearity in the dataset.

Table 3 Variance inflation factor (VIF)

For unit-root analysis, Fig. 2 plots the autocorrelation function (ACF). We observe that all series have the unit-root problem. It is clear that the ACF decreases in a hyperbolic manner and varies further from zero. Tissaoui (2019) indicates that the existence of a unit-root structure in a series corroborates the long-memory pattern in financial time series. Consequently, the differentiation of all series is important for obtaining stationary data before modeling and forecasting. The differentiation findings are illustrated in Fig. 3. We see that the ACF of the all-time series rapidly declines and varies near zero.

Fig. 2
figure 2

Autocorrelation functions results of time-series. a Autocorrelation of WTI. b Autocorrelation of OVX. c Autocorrelation of VIX. d Autocorrelation of EPU. e Autocorrelation of gas

Fig. 3
figure 3

Autocorrelation functions of one-differenced time-series. a Autocorrelation of WTI. b Autocorrelation of OVX. c Autocorrelation of VIX. d Autocorrelation of EPU. e Autocorrelation of gas

3.2 Methodology

ML involves the ability of computers to learn from a particular set of data and then apply it to another set of data. In this regard, Hao et al. (2020) further added that the ML tool has the potential to capture the hidden non-linear pattern and non-stationarity in the crude oil price series (Li et al. 2016, 2018). Li et al. (2021) reported that ML methods are currently the most popular time-series forecasting methods in the literature. One of the reasons for their attractiveness is that these techniques are more appropriate for continuous variables such as crude oil price time series. From this perspective, this study investigated the potential of several ML-based forecasting algorithms for crude oil prices. Three competitor approaches are used to attain this goal: the XGBoost method, SVM, and ARIMAX (p,d,q).

3.2.1 ARIMAX(p,d,q)

In this study, ARIMAX (p,d,q) is considered to be the reference model. The use of this model is motivated by its ability to consider the long-memory behavior shown in the gas price and uncertainty indexes (see Fig. 3). The ARIMAX (p,d,q) model is expressed as:

$${\Delta }^{d}{\mathrm{WTI}}_{t}=\updelta +\sum_{i=1}^{p}{\varphi }_{i}{WTI}_{t-i}+\sum_{i=1}^{k}{\omega }_{i}{GP}_{t-i}+\sum_{i=1}^{k}{\alpha }_{i}{OVX}_{t-i}+\sum_{i=1}^{k}{\beta }_{i}{VIX}_{t-i}+\sum_{i=1}^{k}{\gamma }_{i}{EPU}_{t-i}+\sum_{i=0}^{q}{\sigma }_{i}{\varepsilon }_{t-i}$$
(1)

where \({\mathrm{WTI}}_{t}\) is the crude oil price; \({WTI}_{t-i}\) are the previous values of the WTI;\({GP}_{t-i}\), \({OVX}_{t-i}\),\({VIX}_{t-i}\) and \({EPU}_{t-i}\) are the previous values of the gas price, OVX, VIX, and EPU, respectively. δ is a constant, and d denotes the order of integration.\({\omega }_{i}\),\({\varphi }_{i}\), \({\alpha }_{i}, {\beta }_{i}, {\gamma }_{i}\) and \({\sigma }_{i}\) represent the coefficients, and k, p, and q are the maximum time lags of the forecasters’ sequences, output sequence, and residuals, respectively. The identification of ARIMAX (p,d,q) is achieved using the box and Jenkins method. Subsequently, the Akaike information criterion (AIC) is applied to select the best model from several ARIMAX (p,d,q) models.

3.2.2 Support Vector Machine (SVM)

SVM is a supervised classification method developed by Vapnik (1997), which can also be used for regression using the structural risk minimization (SRM) principle for classification and regression. The SVM method assumes that for training data, \({\{{x}_{i }, {y}_{i}\}}_{i=1}^{n}\) where \({x}_{i }\in {R}^{L}\) is a vector of L input features, \({y}_{i }\in {R}^{L}\) is the output target, and (n) is the total number of data patterns. The aim of SVM is to find a function f(x) that predicts the output value whose deviation is less than the insensitive loss parameter \((\varepsilon )\) from the desired output \({y}_{i}\) for all the training data, and at the same time, is as flat as possible (Smola & Schölkopf, 2004). The linear regression function in the low-dimensional space is mathematically described as follows:

$$f\left(x\right)=w{x}_{i}+b$$
(2)

where x is the weight vector that is normal to the hyperplane and b is the hyperplane bias.

The regression problem is transformed into an optimization problem as follows:

$$ \begin{aligned} & {\text{minimize}}\;\frac{1}{2}\left\| \omega \right\|^{2} + C\sum_{i = 1}^{n} \,\xi i \\ & {\text{subjected}} \\ & to\;\left\{ {\begin{array}{*{20}c} {y_{i} - \left\langle {\omega \left( {x_{i} } \right)} \right\rangle - b \le \varepsilon + \xi_{i} } \\ {\left\langle {\omega \left( {x_{i} } \right)} \right\rangle + b - y_{i} \le \varepsilon + \xi_{i}^{*} } \\ {\xi_{i} ,\xi_{i}^{*} \ge 0} \\ \end{array} } \right. \\ \end{aligned} $$
(3)

where \({\xi }_{i},{\xi }_{i}^{*} \in R\) are the slack variables and C is the penalty coefficient. The Lagrange multiplier is introduced to solve the optimization problem and the regression function takes the following form:

$$f(x)=\sum_{i=1}^{l} \left({\alpha }_{i-}{\alpha }_{i}^{\mathrm{^{\prime}}}\right)k\left({x}_{i},{x}_{i}^{\mathrm{^{\prime}}}\right)+b$$
(4)

where \({\alpha }_{i},{\alpha }_{i}^{\mathrm{^{\prime}}}\) represents the Lagrange multiplier, \(k\left({x}_{i},{x}_{i}^{\mathrm{^{\prime}}}\right)\) is the kernel function The Karush Kuhn-Tucker (KKT) conditions are used to compute (b) (Kuhn & Tucker, 1951; Smola & Schölkopf, 2004).

3.2.3 eXtreme Gradient Boosting (XGBoost) Method

XGBoost is an ML technique developed by Ostrowski and Birman (2006) that can be used for regression and classification problems. This method has been adopted in different domains, such as healthcare (Singh et al., 2019) and the metal market (Ben Jabeur et al., 2021a, 2021b).

Based on the gradient direction of the loss function, it generates a weak learner at each step and accumulates it in the entire model. An objective function is normalized to prevent overfitting and to make the learning process faster. The model output function is given by the following:

$${\widehat{Y}}_{i}^{T}=\sum_{k=1}^{T}{f}_{k}\left({x}_{i}\right)={\widehat{y}}_{i}^{T-1}+{f}_{T}\left({x}_{i}\right)$$
(5)

where \({\widehat{y}}_{i}^{T-1}\) represents the generated tree,\({f}_{T}\left({x}_{i}\right)\) represents the newly created tree model, and T represents the total number of tree models. In addition, Ma et al. (2020) added that XGBoost is robust in terms of modeling nonlinear associations between variables. It has enormous classification ability. Accordingly, many researchers have indicated that ML is a powerful technique for forecasting time-series data. However, it does not provide interpretable inferences in traditional econometrics. To improve the performance of XGBoost, Lundberg and Lee (2017) proposed a Shapley additive explanation method (SHAP) to interpret the prediction of ML techniques based on game theory advanced by Shapley (1953). The SHAP approach allows us to explain the prediction of a specific input (X) by calculating the impact of each feature on prediction. The key idea of SHAP is to calculate the Shapley values for each feature of the sample to be interpreted, where each Shapley value represents the impact that the feature to which it is associated generates in the prediction. Moreover, ML models usually have a large number of features, where each feature is a discrete or continuous variable, which causes it to be computationally very complicated to calculate the Shapley values for each instance of each feature, and the SHAP method is more suitable for dealing with the issue of our research.

The estimated Shapley value is calculated as follows:

$${\widehat{\phi }}_{j}=\frac{1}{K}\sum_{k=1}^{K}\left(\left(\widehat{g}\left({x}_{+j}^{m}\right)-\widehat{g}\left({x}_{-j}^{m}\right)\right)\right.$$
(6)

where \(\widehat{g}\left({x}_{+j}^{m}\right)\) is the prediction for x, but with a random number of feature values.

3.2.4 Performance Metrics

To assess the prediction performance of the different models, we use two criteria: the RMSE and MAE, which are computed as follows:

$$ \bullet \quad {\text{MAE}} = \frac{100}{{N_{v} }}\mathop \sum \limits_{t = 1}^{{N_{v} }} \frac{{\left| {\widehat{{{\text{y}}_{{\text{t}}} }} - {\text{y}}_{{\text{t}}} } \right|}}{{{\overline{\text{y}}}}} $$
(7)
$$ \bullet \quad RMSE = \sqrt {\frac{1}{{N_{V} }}\mathop \sum \limits_{t = 1}^{{N_{V} }} \left( {y_{t} - \hat{y}_{t} } \right)^{2} } $$
(8)

where \({\widehat{y}}_{t}\) is the predicted crude oil price,\({y}_{t}\) is the tth current crude oil price, \(\overline{y }\) represents the mean crude oil price, and \({N}_{v}\) denotes the number of observations served in the validation phase of forecasting.

4 Empirical Results

4.1 Performance Analysis

In this section, we illustrate the empirical findings generated by the aforementioned models applied to examine the simultaneous impact of gas prices with uncertainty indices on the WTI crude oil price. Figure 4 shows the predicted and current series of crude oil prices, referring to which we compare the linear ARIMAX (p,d,q) model against both the linear SVM and non-linear XGBoost on a validation sample (20% of the sample). As shown in Fig. 4, both forecasting ML tools (e.g., the linear SVM and the nonlinear XGBoost) showed that the curves of the predicted values have almost the same behavior as the curve of the current values for the WTI crude oil. However, Fig. 4 shows that the curve relating to the ARIMAX (p,d,q) model has a very different pattern from the curve of the actual crude oil price. This implies that the forecasting ML method appears robust in terms of the prediction of crude oil prices. In addition, we find it difficult to determine the best-fit model between competitors’ ML tools when relying on Fig. 4. Thus, we solve this problem by referring to the performance metrics (RMSE and MAE), which are depicted in Table 4. The forecasting tool with the lowest RMSE and MAE values is selected as the best-fit model. As shown in Table 4, the XGBoost model is dominant in predicting the WTI crude oil price compared with the linear SVM and ARIMAX models. Empirical evidence shows that the values of the performance metrics caused by the XGBoost model (RMSE = 0.0581; MAE = 0.0392) are the lowest compared to those of the SVM and ARIMAX models. Overall, the ML tool, as a complex model, outperforms the linear model in forecasting crude oil prices with good accuracy.

Fig. 4
figure 4

Plot of crude oil prices forecast

Table 4 Prediction assessment of the candidate models

4.2 Feature Analysis

In this section, we focus only on the features of both forecasting machine-learning tools (e.g., linear SVM and nonlinear XGBoost). The ARIMAX model is eliminated from the feature analysis because it generated the worst findings. In addition, the Shapley additive explanation method (SHAP) is used to explain the effect of the gas and uncertainty index variables on the WTI crude oil price. Ben Jabeur et al., (2021a, 2021b) inferred that the Shapley additive explanation method can be used by policymakers and investors to understand ML results, which are characterized by their complexity. Before discussing the feature analysis, it is evident that the convergence of residuals for both linear SVM and nonlinear XGBoost. To achieve this, we use the DALEX R package proposed by Biecek and Burzykowski (2021) to explain the XGBoost and SVM models. The reverse cumulative of the absolute residual from Fig. 5 indicates that there is a lower number of residual in the left tail of the XGBoost residual distribution than the SVM model. The results show that the XGBoost model is more efficient than the SVM model in terms of convergence. These findings support the results of Climent et al. (2019) and Ben Jabeur et al., (2021a, 2021b), who showed the superiority of the XGBoost model over traditional models in gold price forecasting and credit scoring. In addition, our results are in line with Herrera et al. (2019), Wu et al. (2020), and Jabeur et al., (2021a, 2021b), who revealed that machine-learning tools outperform traditional models in forecasting crude oil prices.

Fig. 5
figure 5

Residual convergence

Figures 6 and 7 display the SHAP values for both SVM and XGBoost models by sorting the features by the sum of the magnitudes of the SHAP values over all samples, and using the SHAP values to show the distribution of the impacts of each feature on the model output. The color denotes the value of the feature (high red, low blue). Figures 6 and 7 sort features by the sum of SHAP value magnitudes over all samples, and use SHAP values to show the distribution of the impacts each feature has on the model output.

Fig. 6
figure 6

SVM model: feature importance

Fig. 7
figure 7

XGBoost model: feature importance

The order variables are plotted based on their importance in influencing the WTI crude oil price in terms of forecasting. Each row displays a feature. A redder shape indicates that the feature has a superior value, and a bluer shape indicates that the feature has an inferior value. The SHAP values are plotted on the abscissa. In addition, a positive value of SHAP reflects the positive effect of the input on the output, whereas a negative value of SHAP indicates a negative effect of the input on the output. As observed in Fig. 6, LOVX is the most important feature according to the SVM model, which leads to a negative forecast of WTI crude oil. This had the greatest impact on the model. However, gas prices appear in the second position. Gas and oil prices are positively related. This implies that the uncertainty caused by investors’ fear in the oil market is a good predictor of the WTI crude oil price, followed by the gas price. Moreover, the results show that LVIX has weaker feature importance in forecasting the WTI crude oil price. This positively impacts crude oil. This means that the uncertainty generated by investors’ fear in the financial market is characterized by lower forecasting power for crude oil prices. In addition, the SVM model simulation shows that the LEPU appears to have a less important feature. Therefore, we infer that uncertainty caused by US economic policy has limits as a source of crude oil fluctuations. Similar results are obtained using the XGBoost model. Figure 7 shows that the LOVX is the most important variable that negatively impacts oil prices. More particularly, with a superior LOVX value, the WTI crude oil price may have a smaller probability of decreasing. However, gas price is the most important feature. This indicates that gas prices are an important source for crude oil forecasting. Contrary to the SVM model, the XGBoost model indicates that the forecasting power of LVIX is improved, but it remains less important than the LOVX and gas prices. We also conclude that LEPU exhibits an enhancement in terms of feature importance compared to those shown by the SVM model. However, Figs. 8 and 9 sort the RMSE values to show the impacts that each feature has on the model output. In particular, Figs. 8 and 9 display the width of the interval bands that correspond to variable importance, while the bars indicate the RMSE loss after permutations for the XGBoost and SVM models. The XGBoost model has the lowest RMSE (Fig. 8) compared to the SVM model (Fig. 9). However, as shown previously, the XGBoost model dominates the SVM model in terms of accuracy, performance, and convergence, and the feature importance generated by the XGBoost model is more evident in this study. Thus, we infer that LOVX has a higher ability to predict the WTI crude oil price than the gas price.

Fig. 8
figure 8

SVM model: RMSE loss after permutations

Fig. 9
figure 9

XGBoost model: RMSE loss after permutations

5 Conclusion and Implications

This study examines the forecasting power of gas prices and uncertainty indices for crude oil prices. We attempt to compare two models of ML against linear models to determine which one is effective in forecasting the crude oil price using a dataset from May 10, 2007, to August 09, 2021. In particular, we considered the SVM, XGBoost, and ARIMAX (p,d,q) models to examine the simultaneous effects of the uncertainty indices together with the gas price on crude oil price forecasting.

Interesting results are obtained through this study. First, considering the complex relationship between crude oil and its forecasters, the findings reveal the dominance of ML models, such as the SVM and XGBoost models, over traditional models. The performance metrics are the best in the ML models compared with the ARIMAX model. Second, after eliminating ARIMAX from the analysis, the XGBoost model appears to be superior to the SVM model in terms of accuracy and convergence. Third, the feature importance analysis realized by the Shapley additive explanation method shows that the different uncertainty indexes and gas prices display a significant ability to forecast future WTI crude prices. In addition, the SHAP values highlight that the informational content in LOVX dominates that in other uncertainty indexes and the gas price to forecast WTI crude oil. The results of this study have important policy implications for both investors and policymakers.

First, the investors’ fear in the oil market (represented by OVX) is shown as the dominant forecaster among the gas price and other uncertainty indices affecting the WTI oil prices. This result shows that increased fear among investors can be a factor in fluctuations in oil prices. Therefore, traders should attach great importance to the main source of this fear, which is mainly investing in oil futures contracts. In other words, many oil futures contracts that are bought and sold in the derivatives markets are not designed as they are now for investment but for other purposes, such as hedging. Thus, the oil contract is a dangerous investment because it is not an investment in the first place and is not guaranteed at the time of settlement and the date of delivery. Therefore, inexperienced traders should not treat these contracts as investment tools and apply investment rules known as other investment tools, such as stocks. In the same context, investors should know that they are speculating in a narrow timeframe, and before the settlement date, they must have gotten rid of the contract unless they want to receive oil with a place to store it. Indeed, this allows policymakers to be certain that realizing a stable energy market is mainly related to governing and controlling trading in oil futures contracts in order to limit the fear among investors and, consequently, to have a stable oil price. Second, the original approach, namely the XGBoost model, combined with the Shapley additive explanation method (SHAP), has a higher capacity to accurately consider the complex structure in the relationship between crude oil and its forecasters. In other words, the XGBoost tool showed that it has the ability to predict crude oil prices even if the sample used has periods of crisis.Footnote 5 The good fit of this method will induce financial and energy authorities to profit from these complex tools to predict crude oil prices and other risky assets.

Third, the use of the ARIMAX model as a linear model does not succeed in accurately predicting crude oil prices. These limits compel both investors and policymakers to utilize the linear model to model the association between the crude oil price and its forecasters, which are characterized by complexity.

Although this study offers important findings and provides important policy implications for both investors and policymakers, it has some limitations. First, our study did not consider the COVID-19 pandemic as a variable that could affect the relationship between uncertainty indices and oil price fluctuations. Second, the research question was neither addressed before the emergence of the COVID-19 pandemic nor during its spread. The empirical findings of this study can be improved by addressing these limitations. In the future, research can discover new interpretable deep learning algorithms and more predictive uncertainty forecasters.