Prediction of hierarchical time series using structured regularization and its application to artificial neural networks

Tomokaze Shiratori; Ken Kobayashi; Yuichi Takano

doi:10.1371/journal.pone.0242099

Abstract

This paper discusses the prediction of hierarchical time series, where each upper-level time series is calculated by summing appropriate lower-level time series. Forecasts for such hierarchical time series should be coherent, meaning that the forecast for an upper-level time series equals the sum of forecasts for corresponding lower-level time series. Previous methods for making coherent forecasts consist of two phases: first computing base (incoherent) forecasts and then reconciling those forecasts based on their inherent hierarchical structure. To improve time series predictions, we propose a structured regularization method for completing both phases simultaneously. The proposed method is based on a prediction model for bottom-level time series and uses a structured regularization term to incorporate upper-level forecasts into the prediction model. We also develop a backpropagation algorithm specialized for applying our method to artificial neural networks for time series prediction. Experimental results using synthetic and real-world datasets demonstrate that our method is comparable in terms of prediction accuracy and computational efficiency to other methods for time series prediction.

Citation: Shiratori T, Kobayashi K, Takano Y (2020) Prediction of hierarchical time series using structured regularization and its application to artificial neural networks. PLoS ONE 15(11): e0242099. https://doi.org/10.1371/journal.pone.0242099

Editor: Ivan Kryven, Utrecht University, NETHERLANDS

Received: July 22, 2020; Accepted: October 26, 2020; Published: November 12, 2020

Copyright: © 2020 Shiratori et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All real-world datasets are available from the e-Stat, a portal site for official Japanese statistics (https://www.e-stat.go.jp/en).

Funding: One of the authors, Ken Kobayashi (KK) is employed by Fujitsu Laboratories Ltd. The founder provided support in the form of salaries for KK, but did not have any additional role in the study design, data collection and analysis.

Competing interests: KK is an employee of Fujitsu Laboratories Ltd. but this does not alter our adherence to PLOS ONE policies on sharing data and materials.

Introduction

Multivariate time series data often have a hierarchical (tree) structure in which each upper-level time series is calculated by summing appropriate lower-level time series. For instance, numbers of tourists are usually counted on a regional basis, such as sites, cities, regions, or countries [1]. Similarly, many companies require regionally aggregated forecasts to support resource allocation decisions [2]. Product demand is often analyzed by category to reduce the overall forecasting burden [3].

Forecasts for such hierarchical time series should be coherent, meaning that the forecast for an upper-level time series equals the sum of forecasts for corresponding lower-level time series [4, 5]. Smoothing methods such as the moving average and exponential smoothing are widely used in academia and industry for time series predictions [6, 7]. Although these methods provide coherent forecasts for hierarchical time series, they have low accuracy, especially for rapidly changing time series.

Another common approach for making coherent forecasts is the use of bottom-up and top-down methods [3, 8–10]. These methods first develop base forecasts by separately predicting each time series and then reconcile those base forecasts based on their inherent hierarchical structure. The bottom-up method calculates base forecasts for bottom-level time series and then aggregates them for upper-level time series. In contrast, the top-down method calculates base forecasts only for a root (total) time series and then disaggregates them according to historical proportions of lower-level time series. Park and Nassar [11] considered a hierarchical Bayesian dynamic proportions model for the top-down method to disaggregate upper-level forecasts sequentially. The middle-out method calculates base forecasts for intermediate-level time series and then applies the bottom-up and top-down methods to make upper- and lower-level forecasts. However, the bottom-up method often accumulates prediction errors as the time series level rises, and the top-down method cannot exploit detailed information about lower-level time series. Notably, when base forecasts are unbiased, only the bottom-up method gives unbiased forecasts [12].

Hyndman et al. [12] proposed a linear regression approach to optimal base forecasts by the bottom-up method. This forecast reconciliation method worked well for predicting tourism demand [1] and monthly inflation [13], and this approach can be extended to hierarchical and grouped time series [14]. van Erven and Cugliari [15] devised a game-theoretically optimal reconciliation method. Regularized regression models have also been employed to deal with high-dimensional time series [16, 17]. Wickramasuriya et al. [5] devised a sophisticated method for optimal forecast reconciliation through trace minimization. Their experimental results showed that this trace minimization method performed very well with synthetic and real-world datasets. Note, however, that all of these forecast reconciliation methods consist of two phases: first computing base forecasts and then reconciling those forecasts based on a hierarchical structure. This study aimed to produce better time series predictions by simultaneously completing these two phases.

Structured regularization uses inherent structural relations among explanatory variables to construct a statistical model [18–20]. Various regularization methods have been proposed for multivariate time series [21, 22], hierarchical explanatory variables [23–26], and artificial neural networks [27]. Prediction of multivariate time series is related to multitask learning, which shares useful information among related tasks to enhance the prediction performance for all tasks [28, 29]. Tailored regularization methods have been developed for multitask learning [30, 31] and applied to artificial neural networks [32]. To the best of our knowledge, however, no prior studies have applied structured regularization methods to predictions of hierarchical time series.

In this study, we aimed to develop a structured regularization method that takes full advantage of hierarchical structure for better time series predictions. Our method is based on a prediction model for bottom-level time series and uses a structured regularization term to incorporate upper-level forecasts into the prediction model. This study particularly focused on applying our method to artificial neural networks, which have been effectively used in time series prediction [33–38]. We developed a backpropagation algorithm specialized for our structured regularization model based on artificial neural networks. Experiments involving the application of our method to synthetic and real-world datasets demonstrated that our method was comparable in terms of prediction accuracy and computational efficiency to other methods that develop coherent forecasts for hierarchical time series.

Methods

This section starts with a brief review of forecasts for hierarchical time series. For such time series, we present our structured regularization model and its application to artificial neural networks. A backpropagation algorithm is also described for artificial neural networks with structured regularization.

Forecasts for hierarchical time series

We address the prediction of multivariate time series where each series is represented as a node in a hierarchical (tree) structure. Let y_it be an observation of node i ∈ N at time t ∈ T, where N is the set of nodes, and T is the set of time points. For simplicity, we focus on two-level hierarchical structures. Fig 1 shows the example of a two-level hierarchical structure with |N| = 7 nodes, where | ⋅ | denotes the number of set elements. The nodes are classified as where node 1 is the root (level-zero) node, and M and B are sets of mid-level (level-one) and bottom-level (level-two) nodes, respectively. The associated time series is characterized by the aggregation constraint (1) Each upper-level time series is thus calculated by summing the corresponding lower-level time series.

Download:

Fig 1. Two-level hierarchical structure with |N| = 7.

https://doi.org/10.1371/journal.pone.0242099.g001

A hierarchical structure is represented by the structure matrix H ≔ (h_ki)_{(k,i)∈(N\B)×B} as We define the summing matrix as where I_n is the identity matrix of size n. In Fig 1, we have

Let y_t ≔ (y_it)_i∈N be a column vector comprising observations of all nodes at time t ∈ T. Similarly, for a node subset A ⊆ N we define as the observation vector of nodes i ∈ A at time t ∈ T. In Fig 1, we have

The aggregation constraint (1) is then expressed as (2) or, equivalently, (3)

Let be a column vector comprising base forecasts at time t ∈ T. Note that the base forecasts are calculated separately for each node i ∈ N, so they do not satisfy the aggregation constraint (2). For a node subset A ⊆ N, we define at time t ∈ T. Such base forecasts can be converted into coherent forecasts satisfying the aggregation constraint (2) by using the reconciliation matrix P ≔ (p_ij)_(i,j)∈B×N. Specifically, we develop bottom-level forecasts and use the aggregation constraint (3) to obtain coherent forecasts, as (4)

A typical example of a reconciliation matrix is where O_m×n is an m × n zero matrix. This leads to the bottom-up method (5) Another example is where p = (p_i)_i∈B is a column vector comprising historical proportions of bottom-level time series. This results in the top-down method

In this manner, we can make coherent forecasts from various reconciliation matrices. The condition SPS = S is proven to ensure that when base forecasts are unbiased, the resultant coherent forecasts (4) are also unbiased [12]. This condition is also known to be fulfilled only by the bottom-up method [12].

Forecast reconciliation methods

Hyndman et al. [12] introduced the following linear regression model for given base forecasts: where β_t ≔ (β_it)_i∈B is a column vector of bottom-level estimates, and ε_t ≔ (ε_it)_i∈B is a column vector of errors having zero mean and covariance matrix var(ε_t) ≔ Σ_t. The bottom-up method (5) with is then used to makes coherent forecasts.

If the base forecasts are unbiased and the covariance matrix Σ_t is known, the generalized least-squares estimation yields the minimum variance unbiased estimate of β_t. However, the covariance matrix Σ_t is nonidentifiable and therefore impossible to estimate [5].

In contrast, Wickramasuriya et al. [5] focused on differences between observations and coherent forecasts (4), The associated covariance matrix is derived as (6) where is the covariance matrix of base forecasts. The trace of the covariance matrix (6) is minimized subject to the unbiasedness condition SPS = S. This yields the optimal reconciliation matrix and coherent forecasts (4) are given by (7) See Wickramasuriya et al. [5] for the full details.

Note, however, that in these forecast reconciliation methods, base forecasts are first determined regardless of the underlying hierarchical structure, then those forecasts are corrected based on the hierarchical structure. In contrast, our proposal is a structured regularization model that directly computes high-quality forecasts based on the hierarchical structure.

Structured regularization model

We consider a prediction model for bottom-level time series. Its predictive value is denoted by the column vector , where Θ is a tuple of model parameters. As an example, the first-order vector autoregressive model is represented as where Θ = (θ_ij)_(i,j)∈B×B.

The residual sum of squares for bottom-level time series is given by (8) We also introduce a structured regularization term that quantifies the error for upper-level time series based on the hierarchical structure. Let Λ ≔ Diag(λ) be a diagonal matrix of regularization parameters, where λ ≔ (λ_i)_i∈N\B is a vector of its diagonal entries. Then, we construct a structured regularization term based on the aggregation constraint (2) as (9) Minimizing this term aids in correcting bottom-level forecasts, thus improving the upper-level forecasts.

Adding the regularization term (9) to the residual sum of squares (8) yields the objective function E(Θ) to be minimized. Consequently, our structured regularization model is posed as (10) Here, matrix Λ adjusts the tradeoff between minimizing the error term (8) for bottom-level times series and minimizing the error term (9) for upper-level time series. In the experiments section, we set its diagonal entries as (11) where λ₁ and λ_M are regularization parameters for root and mid-level time series, respectively.

After solving the structured regularization model (10), we use the bottom-up method (5) to obtain coherent forecasts Our structured regularization model based on the bottom-up method may not work well when upper-level time series are easier to predict than bottom-level time series. To remedy this situation, we can adopt a methodology proposed by Panagiotelis et al. [39], where the summing matrix is redefined by replacing a bottom-level time series with an upper-level time series.

Application to artificial neural networks

This study focused on application of our structured regularization model (10) to artificial neural networks for time series prediction; see Bishop [40] and Goodfellow et al. [41] for general descriptions of artificial neural networks. For simplicity, we consider a two-layer neural network like the one shown in Fig 2, where the input vector is defined as

Download:

Fig 2. Network diagram for a two-layer neural network.

https://doi.org/10.1371/journal.pone.0242099.g002

First, we calculate the vector as the weighted sum of the input entries (12) where is a weight matrix to be estimated. This vector u⁽²⁾ is transferred from the input units to hidden units, as shown in Fig 2.

Next, we generate the vector by nonlinear activation functions as Typical examples of activation functions include the logistic sigmoid function (13) and the rectified linear function The vector z⁽²⁾ is transferred from the hidden units to the output units as shown in Fig 2.

Finally, we calculate the vector as the weighted sum of the output entries from the hidden units as (14) where is a weight matrix to be estimated.

This process is summarized as (15) where the tuple of model parameters is

This neural network outputs as a vector of predictive values.

Backpropagation algorithm

We develop a backpropagation algorithm specialized for training artificial neural networks in our structured regularization model (10); see Bishop [40] and Goodfellow et al. [41] for overviews of backpropagation algorithms. Our algorithm sequentially minimizes the following error function for time t ∈ T: (16)

We first define vectors and , which consist of partial derivatives of the error function (16) with respect to intermediate variables (12) and (14) as follows: From Eqs (12) and (14), the partial derivatives of the error function (16) can be calculated as (17) (18)

From Eq (14), we have Therefore, (19) It follows from Eq (16) that (20)

Algorithm 1 summarizes our backpropagation algorithm.

Algorithm 1 Backpropagation algorithm.

Step 0 (Initialization): Let η > 0 be a step size and ε > 0 be a threshold for convergence. Set Θ = (W⁽²⁾, W⁽³⁾) as initial parameter values, and E ← E(Θ) = ∑_t∈T E_T(Θ) as an incumbent value of the objective function.

Step 1 (Backpropagation): Repeat the following steps for all t ∈ T:

Step 1.1: Compute z⁽¹⁾, u⁽²⁾, z⁽²⁾, and u⁽³⁾ from Eq (15).

Step 1.2: Compute δ⁽³⁾ from Eq (20) and then δ⁽²⁾ from Eq (19).

Step 1.3: Compute the partial derivatives (17) and (18).

Step 2 (Gradient Descent): Update the weight parameter values as

Step 3(Termination Condition): If E(Θ) > (1 − ε)E, terminate the algorithm with Θ = (W⁽²⁾, W⁽³⁾). Otherwise, set E ← E(Θ) and return to Step 1.

Experimental results and discussion

The experimental results reported in this section evaluate the effectiveness of our structured regularization model when applied to artificial neural networks. These experiments focused on the two-level hierarchical structure shown in Fig 3, where

Download:

Fig 3. Two-level hierarchical structure with |N| = 13.

https://doi.org/10.1371/journal.pone.0242099.g003

Performance evaluation methodology

To evaluate out-of-sample prediction performance, we considered training and test periods of time series data, where the training period was used to train prediction models, and the test period was used to compute prediction errors in the trained models. We calculated the root-mean-squared error (RMSE) for each node i ∈ N during the test period as

We compared the performance of the following methods for time series prediction.

MA(n): moving average of the previous n values,
ES(α): exponential smoothing with a smoothing parameter α ∈ [0, 1],
NN+BU: bottom-up method (5) using artificial neural networks for base forecasts
NN+MinT: forecast reconciliation method (7) through the trace minimization (i.e., MinT(Shrink) [5]) using artificial neural networks for base forecasts
NN+SR(λ₁, λ_M): our structured regularization model (10) applied to artificial neural networks with regularization parameters λ₁ and λ_M; see also Eq (11)

Here, we determined parameter values for n and α that minimized RMSE in the training period. During the training period, we tuned regularization parameters λ₁ and λ_M through hold-out validation [42].

We adopted two-layer artificial neural networks (Fig 4) for NN+BU, NN+MinT, and NN+SR. Here, prediction of each time series depends on its own two lags y_i,t−1 and y_i,t−2, and the backpropagation simultaneously updates weight parameters of all the series. Note also that NN+BU is equivalent to NN+SR(0,0). Following prior studies [43, 44], we set the number of hidden units to twice the number of input units (i.e., |D| = 4 ⋅ |B|). Bias parameters were added to hidden and output units.

Download:

Fig 4. Two-layer neural network adopted in experimental results.

https://doi.org/10.1371/journal.pone.0242099.g004

We implemented the backpropagation algorithm (Algorithm 1) in the R programming language, with the convergence threshold set as ε = 5 ⋅ 10⁻⁵. The step size was kept constant and set as η = 1 ⋅ 10⁻⁵, which was small enough for the backpropagation algorithm to converge. We employ the logistic sigmoid function (13) as an activation function. The algorithm was repeated 30 times by randomly generating initial values for the parameter Θ from a standard normal distribution. The following sections show average RMSE values with 95% confidence intervals over the 30 trials.

Synthetic datasets

We generated common factors to express correlations among time series. Denote as N(μ, σ²) a normal distribution with mean μ and standard deviation σ. For common factors, we used the first-order autoregressive models where ϕ_i is an autoregressive coefficient, and σ_i is the standard deviation of white noise for the ith common factor. Note that ψ_it reflects the overall trend for i = 1 and mid-level trends for i ∈ M = {2, 3, 4}.

Bottom-level time series were produced by combining the overall trend, mid-level trends, and autocorrelation. We denote the parent (mid-level) node of node i as For bottom-level time series, we used the first-order autoregressive models where ρ_i and θ_i respectively indicate effects of the common factors ψ_1t and ψ_{m(i), t} on the ith time series. After that, we generated upper-level time series (y_it for i ∈ N\B) according to the aggregation constraint (2).

We prepared three synthetic datasets: NgtvC, WeakC, and PstvC. Table 1 lists the parameter values used to generate these datasets. Time series are negatively correlated in the NgtvC dataset, weakly correlated in the WeakC dataset, and positively correlated in the PstvC dataset. Each dataset consists of time series data at 100 time points; the first 70 and latter 30 times were used as training and test periods, respectively.

Download:

Table 1. Parameter values for the synthetic datasets.

https://doi.org/10.1371/journal.pone.0242099.t001

We standardized the generated time series according to the mean and variance over the training period when using artificial neural networks. Specifically, we standardized each bottom-level time series for NN+BU and NN+SR and summed them appropriately to calculate upper-level time series for NN+SR. We standardized each time series at all levels for NN+MinT. We then computed predictive values for these time series and transformed the obtained predictive values into the original (unstandardized) scale. After that, we applied the bottom-up method (NN+BU and NN+SR) and the forecast reconciliation method (NN+MinT) to make coherent forecasts. Finally, we calculated RMSEs on the original scale to evaluate prediction performance.

Results for synthetic datasets

Tables 2–4 list the out-of-sample RMSE values provided by each method for each node in the NgtvC, WeakC, and PstvC datasets. In the tables, the rows labeled “Mid-level” and “Bottom-level” show the average RMSE values over the mid- and bottom-level nodes, respectively, with smallest RMSE values for each node indicated in bold. Note that average RMSE values with 95% confidence intervals are shown for NN+BU, NN+MinT, and NN+SR.

Download:

Table 2. Prediction performance for the NgtvC dataset.

https://doi.org/10.1371/journal.pone.0242099.t002

Download:

Table 3. Prediction performance for the WeakC dataset.

https://doi.org/10.1371/journal.pone.0242099.t003

Download:

Table 4. Prediction performance for the PstvC dataset.

https://doi.org/10.1371/journal.pone.0242099.t004

For the NgtvC dataset (Table 2), our structured regularization method NN+SR was comparable to the forecast reconciliation method and outperformed the other methods, except for the RMSE of the root node. For the WeakC dataset (Table 3), our method was slightly inferior to the exponential smoothing method, but the differences were minimal. For the PstvC dataset (Table 4), our method attained the best prediction performance in terms of average RMSE. These results show that our structured regularization method delivered good prediction performance for the three synthetic datasets. Our method was especially effective when the time series were strongly correlated, as in the NgtvC and PstvC datasets.

We next focus on the parameter values for our structured regularization. Only for the PstvC dataset, our method NN+SR(λ₁, λ_M) adopted λ₁ > 0 and performed significantly better than the bottom-up method in terms of the RMSE of the root node. Additionally, our method employed λ_M > 0 for all three datasets and outperformed the bottom-up method for mid-level RMSEs. These results show an association between regularization weights and prediction accuracy at each time series level. Our method adjusts the regularization parameters to fit the data characteristic, thereby achieving better prediction performance.

Fig 5 shows the training RMSE values as a function of the epoch (number of iterations) in the backpropagation algorithm for the synthetic datasets. Note that the computational efficiency can be evaluated based on epochs because little difference existed between NN+SR and NN+BU in the computation time required for one epoch.

Download:

Fig 5. Convergence of the backpropagation algorithm for the synthetic datasets.

https://doi.org/10.1371/journal.pone.0242099.g005

RMSEs decreased faster for our structured regularization method NN+SR than for the bottom-up method NN+BU. The convergence performance of the two methods greatly differed, especially for the PstvC dataset and upper-level time series. Consequently, our structured regularization method improved both prediction accuracy and convergence speed of the backpropagation algorithm. This suggests that our method will deliver good prediction performance even if the backpropagation algorithm is terminated in the middle of computation.

Fig 6 shows heat maps of the out-of-sample relative RMSE values provided by our structured regularization method NN+SR(λ₁, λ_M) for the synthetic datasets. Here, the vertical and horizontal axes are the values of regularization parameters λ₁ and λ_M, respectively. This figure shows how regularization for each time series level affects the prediction performance.

Download:

Fig 6. Heat maps of relative RMSEs provided by NN+SR(λ₁, λ_M) in the synthetic datasets.

https://doi.org/10.1371/journal.pone.0242099.g006

Note that RMSE values were normalized in Fig 6 such that the RMSE for (λ₁, λ_M) = (0, 0) was zero (white-colored) in each trial. Accordingly, the corresponding regularization is effective if the relative RMSE value is negative (red-colored), and it is ineffective if the relative RMSE value is positive (blue-colored). RMSEs were consistently reduced in the NgtvC dataset. Regularization was particularly effective for the root time series in the PstvC dataset. RMSE values tend to vary drastically from left to right in each heat map, which suggests that the regularization for the mid-level time series greatly impacted prediction performance.

Real-world datasets

We downloaded historical data describing unemployment rates in Japan from e-Stat, a portal site for official Japanese statistics (https://www.e-stat.go.jp/en). Using these data, we prepared three real-world datasets for Japanese regions: Tohoku, Chubu, and Kansai. Table 5 lists the prefectures forming the two-level hierarchical structure (Fig 3).

Download:

Table 5. List of prefectures in the real-world datasets.

https://doi.org/10.1371/journal.pone.0242099.t005

We used quarterly statistics (model-based estimates) of unemployment rates during 90 time periods from January 1997 to June 2019, taking the first 60 and last 30 time periods as the training and test periods, respectively. As a preprocessing step, we removed seasonal and trend components by means of the stl function in the R stats package. We next calculated upper-level time series according to the aggregation constraint (2). After that, we standardized time series, computed predicted values, and calculated RMSEs on the original scale, in the same way as for the synthetic datasets.

Results for real-world datasets

Tables 6–8 list the out-of-sample RMSE values provided by each method for each node in the Tohoku, Chubu, and Kansai datasets. For the Tohoku dataset (Table 6), our structured regularization method NN+SR was comparable to the forecast reconciliation method and substantially outperformed the other methods. For the Chubu dataset (Table 7), our method attained the second-best value for average RMSE.

Download:

Table 6. Prediction performance for the Tohoku dataset.

https://doi.org/10.1371/journal.pone.0242099.t006

Download:

Table 7. Prediction performance for the Chubu dataset.

https://doi.org/10.1371/journal.pone.0242099.t007

Download:

Table 8. Prediction performance for the Kansai dataset.

https://doi.org/10.1371/journal.pone.0242099.t008

For the Kansai dataset (Table 8), our method greatly exceeded the prediction performance of the other methods. These results demonstrate that our structured regularization method achieved good prediction performance for the three real-world datasets.

Fig 7 shows the training RMSE values as a function of epoch in the backpropagation algorithm for the real-world datasets. The convergence of RMSE values was consistently faster for our structured regularization method NN+SR than for the bottom-up method NN+BU. For the Tohoku and Chubu datasets, our method greatly accelerated convergence for upper-level time series. For the Kansai dataset, our method was superior to the bottom-up method in terms of both prediction accuracy and convergence speed. These results suggest that our structured regularization method improves the convergence performance of the backpropagation algorithm.

Download:

Fig 7. Convergence of the backpropagation algorithm for the real-world datasets.

https://doi.org/10.1371/journal.pone.0242099.g007

Fig 8 shows heat maps of the out-of-sample relative RMSE values provided by our structured regularization method NN+SR(λ₁, λ_M) for the real-world datasets. For the Tohoku dataset, the reduction in RMSE values was particularly large for the root time series. For the Chubu dataset, RMSE values changed greatly from left to right in each heat map, meaning that the regularization for mid-level time series was the most effective. For the Kansai dataset, RMSE values can be reduced greatly for all time series levels if the regularization parameters are properly tuned.

Download:

Fig 8. Heat maps of relative RMSEs provided by NN+SR(λ₁, λ_M) in the real-world datasets.

https://doi.org/10.1371/journal.pone.0242099.g008

Conclusion

We proposed a structured regularization model for predicting hierarchical time series. Our model uses the regularization term for improving upper-level forecasts to correct bottom-level forecasts. We demonstrated the application of our model to artificial neural networks for time series prediction. We also developed a backpropagation algorithm specialized for training our model based on artificial neural networks.

We investigated the efficacy of our method through experiments using synthetic and real-world datasets. The experimental results demonstrated that our method, which can adjust regularization parameters to fit data characteristics, compared favorably in prediction performance with other methods that develop coherent forecasts for hierarchical time series. Our regularization term accelerated the backpropagation algorithm. Regularization for mid-level time series was closely related to prediction performance.

One contribution made by this study is the establishment of a new computational framework of artificial neural networks for time series predictions. Moreover, our experiments using synthetic and real-world datasets demonstrated the potential of specialized prediction methods for hierarchical time series. We hope that this study will stimulate further research on exploiting structural properties for better time series predictions.

In future studies, we will extend our structured regularization model to other time series prediction methods, such as the autoregressive integrated moving average model [6, 7] and support vector regression [8]. Another direction of future research will be to develop a high-performance estimation algorithm for our method based on various mathematical optimization techniques [45–50].

References

1. Athanasopoulos G., Ahmed R. A., & Hyndman R. J. (2009). Hierarchical forecasts for Australian domestic tourism. International Journal of Forecasting, 25(1), 146–166.
- View Article
- Google Scholar
2. Kremer M., Siemsen E., & Thomas D. J. (2016). The sum and its parts: Judgmental hierarchical forecasting. Management Science, 62(9), 2745–2764.
- View Article
- Google Scholar
3. Fliedner G. (1999). An investigation of aggregate variable time series forecast strategies with specific subaggregate time series statistical correlation. Computers & Operations Research, 26(10–11), 1133–1149.
- View Article
- Google Scholar
4. Ben Taieb S., Taylor J. W., & Hyndman R. J. (2017, August). Coherent probabilistic forecasts for hierarchical time series. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 3348–3357). JMLR. org.
5. Wickramasuriya S. L., Athanasopoulos G., & Hyndman R. J. (2019). Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization. Journal of the American Statistical Association, 114(526), 804–819.
- View Article
- Google Scholar
6. Brockwell P. J., Davis R. A., & Fienberg S. E. (1991). Time series: Theory and methods. Springer Science & Business Media.
7. Hyndman R. J., & Athanasopoulos G. (2018). Forecasting: principles and practice. OTexts.
- View Article
- Google Scholar
8. Karmy J. P., & Maldonado S. (2019). Hierarchical time series forecasting via support vector regression in the European travel retail industry. Expert Systems with Applications, 137, 59–73.
- View Article
- Google Scholar
9. Lütkepohl H. (2011). Forecasting aggregated time series variables. OECD Journal: Journal of Business Cycle Measurement and Analysis, 2010(2), 1–26.
- View Article
- Google Scholar
10. Widiarta H., Viswanathan S., & Piplani R. (2009). Forecasting aggregate demand: An analytical evaluation of top-down versus bottom-up forecasting in a production planning framework. International Journal of Production Economics, 118(1), 87–94.
- View Article
- Google Scholar
11. Park M., & Nassar M. (2014). Variational Bayesian inference for forecasting hierarchical time series. In International conference on machine learning (ICML), workshop on divergence methods for probabilistic inference, Beijing.
12. Hyndman R. J., Ahmed R. A., Athanasopoulos G., & Shang H. L. (2011). Optimal combination forecasts for hierarchical time series. Computational Statistics & Data Analysis, 55(9), 2579–2589.
- View Article
- Google Scholar
13. Capistrán C., Constandse C., & Ramos-Francia M. (2010). Multi-horizon inflation forecasts using disaggregated data. Economic Modelling, 27(3), 666–677.
- View Article
- Google Scholar
14. Hyndman R. J., Lee A. J., & Wang E. (2016). Fast computation of reconciled forecasts for hierarchical and grouped time series. Computational Statistics & Data Analysis, 97, 16–32.
- View Article
- Google Scholar
15. van Erven T., & Cugliari J. (2015). Game-theoretically optimal reconciliation of contemporaneous hierarchical time series forecasts. In Modeling and stochastic learning for forecasting in high dimensions (pp. 297–317). Springer, Cham.
16. Ben Taieb S., & Koo B. (2019, July). Regularized Regression for Hierarchical Forecasting Without Unbiasedness Conditions. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1337–1347).
17. Ben Taieb, S., Yu, J., Barreto, M. N., & Rajagopal, R. (2017, February). Regularization in hierarchical time series forecasting with application to electricity smart meter data. In Thirty-First AAAI Conference on Artificial Intelligence.
18. Hastie T., Tibshirani R., & Wainwright M. (2015). Statistical learning with sparsity: the lasso and generalizations. CRC press.
19. Jenatton R., Audibert J. Y., & Bach F. (2011). Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research, 12(Oct), 2777–2824.
- View Article
- Google Scholar
20. Zhao P., Rocha G., & Yu B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37(6A), 3468–3497.
- View Article
- Google Scholar
21. Nicholson W. B., Matteson D. S., & Bien J. (2017). VARX-L: Structured regularization for large vector autoregressions with exogenous variables. International Journal of Forecasting, 33(3), 627–651.
- View Article
- Google Scholar
22. Schimbinschi F., Moreira-Matias L., Nguyen V. X., & Bailey J. (2017). Topology-regularized universal vector autoregression for traffic forecasting in large urban areas. Expert Systems with Applications, 82, 301–316.
- View Article
- Google Scholar
23. Bien J., Taylor J., & Tibshirani R. (2013). A lasso for hierarchical interactions. Annals of statistics, 41(3), 1111. pmid:26257447
- View Article
- PubMed/NCBI
- Google Scholar
24. Kim S., & Xing E. P. (2012). Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. The Annals of Applied Statistics, 6(3), 1095–1117.
- View Article
- Google Scholar
25. Lim M., & Hastie T. (2015). Learning interactions via hierarchical group-lasso regularization. Journal of Computational and Graphical Statistics, 24(3), 627–654. pmid:26759522
- View Article
- PubMed/NCBI
- Google Scholar
26. Sato T., Takano Y., & Nakahara T. (2019). Investigating consumers’ store-choice behavior via hierarchical variable selection. Advances in Data Analysis and Classification, 13(3), 621–639.
- View Article
- Google Scholar
27. Wen W., Wu C., Wang Y., Chen Y., & Li H. (2016). Learning structured sparsity in deep neural networks. In Advances in neural information processing systems (pp. 2074–2082).
28. Caruana R. (1997). Multitask learning. Machine learning, 28(1), 41–75.
- View Article
- Google Scholar
29. Zhang Y., & Yang Q. (2017). A survey on multi-task learning. arXiv preprint arXiv:1707.08114.
30. Evgeniou, T., & Pontil, M. (2004, August). Regularized multi-task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117).
31. Jacob L., Vert J. P., & Bach F. R. (2009). Clustered multi-task learning: A convex formulation. In Advances in neural information processing systems (pp. 745–752).
32. Ruder S. (2017). An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.
33. Gao J., Murphey Y. L., & Zhu H. (2018). Multivariate time series prediction of lane changing behavior using deep neural network. Applied Intelligence, 48(10), 3523–3537.
- View Article
- Google Scholar
34. Hsieh W. W. (2004). Nonlinear multivariate and time series analysis by neural network methods. Reviews of Geophysics, 42(1).
- View Article
- Google Scholar
35. Khashei M., & Bijari M. (2010). An artificial neural network (p, d, q) model for timeseries forecasting. Expert Systems with Applications, 37(1), 479–489.
- View Article
- Google Scholar
36. Lai, G., Chang, W. C., Yang, Y., & Liu, H. (2018, June). Modeling long-and short-term temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (pp. 95–104).
37. Zhang G. P. (2003). Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50, 159–175.
- View Article
- Google Scholar
38. Zhang G. P., & Qi M. (2005). Neural network forecasting for seasonal and trend time series. European Journal of Operational Research, 160(2), 501–514.
- View Article
- Google Scholar
39. Panagiotelis A., Athanasopoulos G., Gamakumara P., & Hyndman R. J. (2020). Forecast reconciliation: A geometric view with new insights on bias correction. International Journal of Forecasting.
- View Article
- Google Scholar
40. Bishop C. M. (2006). Pattern recognition and machine learning. Springer.
41. Goodfellow I., Bengio Y., & Courville A. (2016). Deep learning. MIT press.
42. Bergmeir C., Hyndman R. J., & Koo B. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120, 70–83.
- View Article
- Google Scholar
43. Karsoliya S. (2012). Approximating number of hidden layer neurons in multiple hidden layer BPNN architecture. International Journal of Engineering Trends and Technology, 3(6), 714–717.
- View Article
- Google Scholar
44. Lippmann R. (1987). An introduction to computing with neural nets. IEEE ASSP Magazine, 4(2), 4–22.
- View Article
- Google Scholar
45. Bertsimas D., King A., & Mazumder R. (2016). Best subset selection via a modern optimization lens. The annals of statistics, 813–852.
- View Article
- Google Scholar
46. Bertsimas D., Pauphilet J., & Van Parys B. (2019). Sparse regression: Scalable algorithms and empirical performance. arXiv preprint arXiv:1902.06547.
47. Kudo K., Takano Y., & Nomura R. (2020). Stochastic discrete first-order algorithm for feature subset selection. IEICE Transactions on Information and Systems, 103(7), 1693–1702.
- View Article
- Google Scholar
48. Takano Y., & Miyashiro R. (2020). Best subset selection via cross-validation criterion. TOP, 28(2), 475–488.
- View Article
- Google Scholar
49. Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., & Matsui T. (2017). Best subset selection for eliminating multicollinearity. Journal of the Operations Research Society of Japan, 60(3), 321–336.
- View Article
- Google Scholar
50. Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., & Matsui T. (2019). Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor. Journal of Global Optimization, 73(2), 431–446.
- View Article
- Google Scholar

[ref1] 1. Athanasopoulos G., Ahmed R. A., & Hyndman R. J. (2009). Hierarchical forecasts for Australian domestic tourism. International Journal of Forecasting, 25(1), 146–166.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Kremer M., Siemsen E., & Thomas D. J. (2016). The sum and its parts: Judgmental hierarchical forecasting. Management Science, 62(9), 2745–2764.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Fliedner G. (1999). An investigation of aggregate variable time series forecast strategies with specific subaggregate time series statistical correlation. Computers & Operations Research, 26(10–11), 1133–1149.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Ben Taieb S., Taylor J. W., & Hyndman R. J. (2017, August). Coherent probabilistic forecasts for hierarchical time series. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 3348–3357). JMLR. org.

[ref5] 5. Wickramasuriya S. L., Athanasopoulos G., & Hyndman R. J. (2019). Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization. Journal of the American Statistical Association, 114(526), 804–819.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref6] 6. Brockwell P. J., Davis R. A., & Fienberg S. E. (1991). Time series: Theory and methods. Springer Science & Business Media.

[ref7] 7. Hyndman R. J., & Athanasopoulos G. (2018). Forecasting: principles and practice. OTexts.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref8] 8. Karmy J. P., & Maldonado S. (2019). Hierarchical time series forecasting via support vector regression in the European travel retail industry. Expert Systems with Applications, 137, 59–73.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref9] 9. Lütkepohl H. (2011). Forecasting aggregated time series variables. OECD Journal: Journal of Business Cycle Measurement and Analysis, 2010(2), 1–26.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref10] 10. Widiarta H., Viswanathan S., & Piplani R. (2009). Forecasting aggregate demand: An analytical evaluation of top-down versus bottom-up forecasting in a production planning framework. International Journal of Production Economics, 118(1), 87–94.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref11] 11. Park M., & Nassar M. (2014). Variational Bayesian inference for forecasting hierarchical time series. In International conference on machine learning (ICML), workshop on divergence methods for probabilistic inference, Beijing.

[ref12] 12. Hyndman R. J., Ahmed R. A., Athanasopoulos G., & Shang H. L. (2011). Optimal combination forecasts for hierarchical time series. Computational Statistics & Data Analysis, 55(9), 2579–2589.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref13] 13. Capistrán C., Constandse C., & Ramos-Francia M. (2010). Multi-horizon inflation forecasts using disaggregated data. Economic Modelling, 27(3), 666–677.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref14] 14. Hyndman R. J., Lee A. J., & Wang E. (2016). Fast computation of reconciled forecasts for hierarchical and grouped time series. Computational Statistics & Data Analysis, 97, 16–32.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref15] 15. van Erven T., & Cugliari J. (2015). Game-theoretically optimal reconciliation of contemporaneous hierarchical time series forecasts. In Modeling and stochastic learning for forecasting in high dimensions (pp. 297–317). Springer, Cham.

[ref16] 16. Ben Taieb S., & Koo B. (2019, July). Regularized Regression for Hierarchical Forecasting Without Unbiasedness Conditions. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1337–1347).

[ref17] 17. Ben Taieb, S., Yu, J., Barreto, M. N., & Rajagopal, R. (2017, February). Regularization in hierarchical time series forecasting with application to electricity smart meter data. In Thirty-First AAAI Conference on Artificial Intelligence.

[ref18] 18. Hastie T., Tibshirani R., & Wainwright M. (2015). Statistical learning with sparsity: the lasso and generalizations. CRC press.

[ref19] 19. Jenatton R., Audibert J. Y., & Bach F. (2011). Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research, 12(Oct), 2777–2824.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref20] 20. Zhao P., Rocha G., & Yu B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37(6A), 3468–3497.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref21] 21. Nicholson W. B., Matteson D. S., & Bien J. (2017). VARX-L: Structured regularization for large vector autoregressions with exogenous variables. International Journal of Forecasting, 33(3), 627–651.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref22] 22. Schimbinschi F., Moreira-Matias L., Nguyen V. X., & Bailey J. (2017). Topology-regularized universal vector autoregression for traffic forecasting in large urban areas. Expert Systems with Applications, 82, 301–316.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref23] 23. Bien J., Taylor J., & Tibshirani R. (2013). A lasso for hierarchical interactions. Annals of statistics, 41(3), 1111. pmid:26257447
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref24] 24. Kim S., & Xing E. P. (2012). Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. The Annals of Applied Statistics, 6(3), 1095–1117.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref25] 25. Lim M., & Hastie T. (2015). Learning interactions via hierarchical group-lasso regularization. Journal of Computational and Graphical Statistics, 24(3), 627–654. pmid:26759522
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref26] 26. Sato T., Takano Y., & Nakahara T. (2019). Investigating consumers’ store-choice behavior via hierarchical variable selection. Advances in Data Analysis and Classification, 13(3), 621–639.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref27] 27. Wen W., Wu C., Wang Y., Chen Y., & Li H. (2016). Learning structured sparsity in deep neural networks. In Advances in neural information processing systems (pp. 2074–2082).

[ref28] 28. Caruana R. (1997). Multitask learning. Machine learning, 28(1), 41–75.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref29] 29. Zhang Y., & Yang Q. (2017). A survey on multi-task learning. arXiv preprint arXiv:1707.08114.

[ref30] 30. Evgeniou, T., & Pontil, M. (2004, August). Regularized multi-task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117).

[ref31] 31. Jacob L., Vert J. P., & Bach F. R. (2009). Clustered multi-task learning: A convex formulation. In Advances in neural information processing systems (pp. 745–752).

[ref32] 32. Ruder S. (2017). An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.

[ref33] 33. Gao J., Murphey Y. L., & Zhu H. (2018). Multivariate time series prediction of lane changing behavior using deep neural network. Applied Intelligence, 48(10), 3523–3537.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref34] 34. Hsieh W. W. (2004). Nonlinear multivariate and time series analysis by neural network methods. Reviews of Geophysics, 42(1).
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref35] 35. Khashei M., & Bijari M. (2010). An artificial neural network (p, d, q) model for timeseries forecasting. Expert Systems with Applications, 37(1), 479–489.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref36] 36. Lai, G., Chang, W. C., Yang, Y., & Liu, H. (2018, June). Modeling long-and short-term temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (pp. 95–104).

[ref37] 37. Zhang G. P. (2003). Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50, 159–175.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref38] 38. Zhang G. P., & Qi M. (2005). Neural network forecasting for seasonal and trend time series. European Journal of Operational Research, 160(2), 501–514.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref39] 39. Panagiotelis A., Athanasopoulos G., Gamakumara P., & Hyndman R. J. (2020). Forecast reconciliation: A geometric view with new insights on bias correction. International Journal of Forecasting.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref40] 40. Bishop C. M. (2006). Pattern recognition and machine learning. Springer.

[ref41] 41. Goodfellow I., Bengio Y., & Courville A. (2016). Deep learning. MIT press.

[ref42] 42. Bergmeir C., Hyndman R. J., & Koo B. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120, 70–83.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref43] 43. Karsoliya S. (2012). Approximating number of hidden layer neurons in multiple hidden layer BPNN architecture. International Journal of Engineering Trends and Technology, 3(6), 714–717.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref44] 44. Lippmann R. (1987). An introduction to computing with neural nets. IEEE ASSP Magazine, 4(2), 4–22.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref45] 45. Bertsimas D., King A., & Mazumder R. (2016). Best subset selection via a modern optimization lens. The annals of statistics, 813–852.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref46] 46. Bertsimas D., Pauphilet J., & Van Parys B. (2019). Sparse regression: Scalable algorithms and empirical performance. arXiv preprint arXiv:1902.06547.

[ref47] 47. Kudo K., Takano Y., & Nomura R. (2020). Stochastic discrete first-order algorithm for feature subset selection. IEICE Transactions on Information and Systems, 103(7), 1693–1702.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref48] 48. Takano Y., & Miyashiro R. (2020). Best subset selection via cross-validation criterion. TOP, 28(2), 475–488.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref49] 49. Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., & Matsui T. (2017). Best subset selection for eliminating multicollinearity. Journal of the Operations Research Society of Japan, 60(3), 321–336.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

[ref50] 50. Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., & Matsui T. (2019). Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor. Journal of Global Optimization, 73(2), 431–446.
View Article
Google Scholar

[119] View Article

[120] Google Scholar

Figures

Abstract

Introduction

Methods

Forecasts for hierarchical time series

Forecast reconciliation methods

Structured regularization model

Application to artificial neural networks

Backpropagation algorithm

Experimental results and discussion

Performance evaluation methodology

Synthetic datasets

Results for synthetic datasets

Real-world datasets

Results for real-world datasets

Conclusion

References