A Kriging based spatiotemporal approach for traffic volume data imputation

Hongtai Yang; Jianjiang Yang; Lee D. Han; Xiaohan Liu; Li Pu; Shih-miao Chin; Ho-ling Hwang

doi:10.1371/journal.pone.0195957

Abstract

Along with the rapid development of Intelligent Transportation Systems, traffic data collection technologies have progressed fast. The emergence of innovative data collection technologies such as remote traffic microwave sensor, Bluetooth sensor, GPS-based floating car method, and automated license plate recognition, has significantly increased the variety and volume of traffic data. Despite the development of these technologies, the missing data issue is still a problem that poses great challenge for data based applications such as traffic forecasting, real-time incident detection, dynamic route guidance, and massive evacuation optimization. A thorough literature review suggests most current imputation models either focus on the temporal nature of the traffic data and fail to consider the spatial information of neighboring locations or assume the data follow a certain distribution. These two issues reduce the imputation accuracy and limit the use of the corresponding imputation methods respectively. As a result, this paper presents a Kriging based data imputation approach that is able to fully utilize the spatiotemporal correlation in the traffic data and that does not assume the data follow any distribution. A set of scenarios with different missing rates are used to evaluate the performance of the proposed method. The performance of the proposed method was compared with that of two other widely used methods, historical average and K-nearest neighborhood. Comparison results indicate that the proposed method has the highest imputation accuracy and is more flexible compared to other methods.

Citation: Yang H, Yang J, Han LD, Liu X, Pu L, Chin S-m, et al. (2018) A Kriging based spatiotemporal approach for traffic volume data imputation. PLoS ONE 13(4): e0195957. https://doi.org/10.1371/journal.pone.0195957

Editor: Xiaolei Ma, Beihang University, CHINA

Received: August 17, 2017; Accepted: April 3, 2018; Published: April 17, 2018

Copyright: © 2018 Yang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Along with the rapid development of Intelligent Transportation Systems (ITS), traffic data collection technologies have been evolving dramatically. [1, 2]. On the one hand, the emergence of innovative data collection technologies such as remote traffic microwave sensor (RTMS), Bluetooth sensor, and GPS-based floating car method have made traffic data collection much easier than before. [3–5].

Despite the development of technologies, the missing data problem still exists. Missing data could be due to various reasons such as malfunction of sensors and loss of communication. The Mobility Monitoring Program of the Texas Transportation Institute (TTI) reported that after screening erroneous data, the complete rate of collected data can be anywhere between 16% and 93% with a median value of 67% [6]. Williams and Hoel [7] reported that the data missing rate collected by Georgia’s statewide advanced traffic management system was 10% or higher. The data missing rate of the freeway Performance Measurement System (PeMS) in Los Angeles was found to be 15% [8]. Chandra and Al-Deek [9] reported a 15% missing rate of data collected by loop detectors on I-4 in Orlando, Florida. An empirical study showed that the average missing rate of data collected by Georgia NaviGAtor system at GA 400 was between 4% and 14% [10]. In Beijing, China, the average missing rate of the daily traffic volume data was about 10% (4% due to malfunction of detectors and 6% due to other reasons) with the missing rate of data collected by some detectors as high as 25% [11].

The missing data issue has posed great challenges for data based applications such as traffic forecasting, incident detection, route guidance, and massive evacuation optimization. Therefore, a lot of efforts need to be made to impute the missing data.

Most current imputation techniques could estimate a single value for the missing data point. These techniques include heuristic imputation, prediction imputation, and statistical learning imputation etc. The heuristic imputation methods fill the missing data point by averaging data of the same time period on neighboring days or averaging data of neighboring time periods of the same day. These methods are based on the assumption that traffic characteristics are similar at the same time period of different days or the fluctuations of traffic data are low during short time period [12]. Another group of heuristic methods are called pattern-similar imputation methods which search for the most similar traffic data series from historical data and use it to estimate missing data points [13]. These heuristic methods make good use of the similarity and periodicity of traffic data. However, the local variation and unexpected changes of traffic pattern could result in high imputation inaccuracy [14, 15]. To address this issue, two advanced methods, Bayesian Principal Component Analysis (BPCA) and Probabilistic Principal Component Analysis (PPCA), were recently proposed by Qu et al. [11, 16]. Researchers first show that traffic flow data follow Gaussian distribution and that principal component analysis (PCA) can be used to retrieve the features of traffic flow. Then, a robust PCA is used to filter out the abnormal traffic flow data that disturb the imputation process. The difference between BPCA and PPCA is that BPCA is slower than PPCA but yields similar results. BPCA is usually carried out first on a small sample to determine the important parameters. Then, the imputation tasks are performed by PPCA with those parameters.

Prediction method is also an important way to impute data. Regression method is a classic example. Al-Deek et al. [9] compared the feasibility and imputation accuracy of three regression models, multiple regression, time series, and pairwise regression. They found that quadratic model performed better because of its ability to capture nonlinear relationships among variables. To impute missing traffic data during holidays, Liu et al. introduced a new procedure using non-parametric regression, the K-nearest neighborhood (KNN) method, estimate missing values for different types of highways on holidays [17]. Other regression models that have been used for imputation include ARIMA [18], support vector regression [19], exponential smoothing [18], neural network [20], hidden Markov Model [21] and so on [12]. However, these prediction methods can only use data before the missing data point and ignore the data after the missing data, which means they cannot take full advantage of the data set for imputation.

Statistical learning imputation methods assume that data are missing at random. Specifically, the missing data are considered as realizations of random variables characterized by a certain probability distribution function. Antonio et al. proposed an incremental approach theoretically motivated by the Statistical Learning Theory of Vapnik, and provided a new paradigm for missing data imputation [22]. Ma et al. employed copula theory to build a connection between the correlation function and the marginal distribution function of traffic flow, and proved effectiveness of the method to impute missing data in large-scale transportation networks [23].

Most of the methods mentioned above only use temporal information for imputation, while spatial information is not well used. As traffic flows from upstream to downstream, the traffic stream characteristics at a certain location are usually closely related to those at neighboring locations. The incorporation of surrounding traffic information has been proved to be useful to improve traffic prediction accuracy [24–26]. Literature review results show that Markov chain Monte Carlo (MCMC) [10] and PPCA [11] are two representative methods that use both temporal and spatial information. However, both MCMC and PPCA methods assume a probability distribution model of the data [12]. This assumption limits the use of these methods since when the data does not follow a specific distribution these methods may generate inaccurate imputation results. As a result, this paper proposed an alternative method, a Kriging based method, that does not assume the data follow any probability distribution and that can fully use both temporal and spatial information, to impute data.

The rest of the paper is organized as follows. Section 2 describes the proposed Kriging based imputation approach and other benchmark models that are used for comparison. Study location and data are described in Section 3. A brief description of data missing patterns and missing ratios are presented in Section 4. Section 5 compares imputation accuracy of proposed approach with benchmark models, historical average and KNN. Concluding remarks are given in section 6.

Methodology

Kriging based spatiotemporal imputation approach

Background about Kriging.

Kriging originated in the mining industry in the early 1950’s as a means of improving ore reserve estimation and has been used as synonym for geo-statistical interpolation for many decades. Traditionally, the Kriging method only deals with spatial variables. Consider a set of spatial data z (μ_i) of an attribute z at location U_i, i = 1, 2, 3, ..,n, where U is a vector of spatial coordinates μ_i = (x_i, y_i). The task of data imputation is to estimate missing values of z at a set of m locations. Generally speaking, Kriging is just optimal interpolation method based on regression using observed surrounding data points, weighted according to covariance values. Compared with other methods, the Kriging method has following advantages: 1) It can reduce the effect of data clustering by assigning data points within a cluster less weight; 2) It can produce a measure for possible estimation error (Kriging variance), along with the estimation of the missing values [27].

Kriging based spatiotemporal imputation.

Traffic stream characteristics change over time and space. Traffic volume at a location is not only correlated with the traffic volumes at upstream and downstream locations but also correlated with volumes of the previous and next time step [28]. Thus, time dimension needs to be considered in the Kriging model to better estimate the missing data.

In the example of traffic volume, each data point is referenced by its temporal timestamp t_i, and spatial location μ_α = (x_α, y_α). Different from the aforementioned traditional spatial Kriging models, the coordinates are simplified as μ_α = x_α as roads can be seen as a longitudinal system with only one spatial dimension, in which x_α is the mile marker.

In the space-time framework, traffic volume is formulated as Q (μ_α, t_i); α = 1, 2, …, n; i = 1, 2, …, m. Similar to the spatial models, the covariance is defined as the variance of the mean squared difference between data separated by a given spatial and temporal lag (h_s, t_s): (1)

To be consistent with the common practice in spatial statistics, experiment semivariogram is computed as half of covariance: (2)

In the ordinary space-time Kriging system, the missing value Q^*(μ, t) can be estimated as weighted average of values of surrounding locations: (3)

The weights λ_{α, i} (μ_α, t_i) assigned to each neighboring data point are calculated by minimizing the prediction variance: (4) while maintaining unbiasedness of the estimated value Q^*(μ, t).

As the calculation of covariance is based on both spatial and temporal distance between data points, the spatial and temporal correlations of traffic volumes are well considered and utilized in the model. In this study, the Gaussian variogram method is used to approximate empirical variogram in the proposed spatiotemporal imputation method. It should be noted that the temporal and spatial properties of data are not similar, which makes it difficult for the variogram to capture the temporal and spatial variability. To address this issue, the very straightforward solution is to regard time dimension as the third orthogonal dimension and to extend traditional 2-dimensional Kriging to a 3-dimensional Kriging. In addition, the temporal dimension has to be rescaled to align with the spatial directions. All the works mentioned above are implemented using R studio and related packages.

Benchmark imputation methods for comparison

To evaluate the performance of the Kriging based spatiotemporal approach, the results were compared with those of two classical imputation models, historical average and KNN.

Historical average (same time and weekdays and same stations).

The historical average model is a widely used prediction model [29]. A missing data point is estimated by averaging data points of the same location at the same time of the day on the same day of the week. To be more robust to extreme values, the historical median can be used too.

K-nearest neighborhood.

Because the data were recorded every 30 seconds by sensors, 2880 data points were collected every day (2880 = 24h * 60min/h * 60s/min / 30s). In order to implement KNN method [30], the traffic volume data needs to be reformatted as a 2880×(s_* d) matrix), where s is the number of stations and d is total number of days. After the transformation, the data collected at a given station on a specific day is considered as a column of the matrix.

For the column with missing values, the Euclidean distances between this column and other columns were calculated to find k nearest neighbors. Finally, the weights for the k nearest neighbors were derived and the estimation of missing values were the weighted averages of k nearest neighbors [31].

Evaluation criteria

Mean absolute deviation (MAD) and root mean squared error (RMSE) were used to compare imputation results of proposed approach with benchmark methods. Suppose there were n missing data points in the test dataset with as ground truth for i^th missing data point and as the estimated value for the missing data point. The two measures could be calculated as follows: (5) (6)

Data source and study locations

Smart Way [32], a key program of Tennessee’s intelligent transportation system, uses solar-powered nonintrusive RTMS to collect real-time highway traffic information (including volume, speed and occupancy). The collected data are sent to the traffic management center. The data used in this study are collected by these RTMS radars installed along interstates across Tennessee. Vehicle presence, traffic volume, speed, and occupancy per lane are recorded every 30 seconds by these sensors [33].

To better identify data missing patterns, a long period of 33 days of data (from April 29 to May 31, 2013) are collected for six RTMS stations along Ellington Parkway in Nashville, Tennessee [34]. The detailed description of RTMS stations is presented in Table 1 and their locations are given in Fig 1. As the data are collected every 30 seconds, a total of 570,240 (2880×33×6) data points would be obtained if no data were missing.

Download:

Fig 1. RTMS stations for this study.

https://doi.org/10.1371/journal.pone.0195957.g001

Download:

Table 1. Description of RTMS stations.

https://doi.org/10.1371/journal.pone.0195957.t001

Different from previous studies, imputation was performed on the raw data in this study instead of aggregated data to prevent information loss during the aggregation process. The data description and missing rates are shown in Table 2. Numbers in parenthesis indicate corresponding standard errors. The average count means the average number of vehicles that were recorded by sensors over 30 seconds.

Download:

Table 2. Data description.

https://doi.org/10.1371/journal.pone.0195957.t002

Data missing rates

To understand RTMS radars’ performance, a boxplot of missing rates by station and day of the week is shown in Fig 2. It shows that the performance of a station varies across days and the performance of different stations on the same day also varies significantly. Station 115 usually has the lowest data missing rates with only a few exceptions. In contrast, station 117 has the highest data missing rates across the week.

Download:

Fig 2. Boxplot of data missing rates by station and day of the week.

https://doi.org/10.1371/journal.pone.0195957.g002

Evaluation of imputation performance

Experiment design of data missing scenarios

A complete data set is preferred to train the proposed and benchmark models and to evaluate their performance. A close look at the data reveals that the data collected by station 119 on May 17, 2013 has a low data missing rate, 0.03% (only one data point is missing), and thus is used in this study. Another reason for choosing station 119 is that there are both upstream stations and downstream stations, which means there are both upstream and downstream information available.

To compare the imputation performance, imputation methods are tested based on simulated scenarios with different data missing rates. The missing rates are set to be different percentiles of the actual missing rates for all stations during the 33 days of the study. Also, the missing data points are generated randomly. The simulation process for the simulation is as follows:

Choose a specific data missing rate among 25%, 30%, 35%, …, 75% percentiles of missing rates of all stations during the study period;
Based on the missing rate selected above, generate the number of points to be flagged as missing in the dataset;
Generate missing data points randomly;
Repeat step 1 to 3 for different missing rates to generate corresponding scenarios;
Perform imputation on these generated scenarios using the proposed method and benchmark methods, and compare their results.

For the whole day of May 17, 2013, the traffic was congested during the rush hours and was in free-flow condition during the non-rush hours, just like the other days. Since the missing data points were generated randomly, with missing rate ranging from 1.0% and 36.1%, the missing data was likely to cover both free-flow conditions and congested conditions.

Imputation performance

The proposed imputation method and the benchmark methods were tested on 11 different scenarios. The semivariogram is shown in Fig 3. Imputation results are shown in Table 3 and Fig 4. It can be seen from the table that the proposed imputation method is more accurate than the other two methods in most scenarios. Only when the missing rate is lower than 1%, the performance of the historical average method is better than the proposed imputation method. KNN method usually has the lowest imputation accuracy. This may be due to that there are only three features in this study for KNN to determine the nearest neighbors while KNN usually needs more than three features to obtain a reliable result [1].

Download:

Fig 3. Initial variogram.

https://doi.org/10.1371/journal.pone.0195957.g003

Download:

Fig 4. Imputation performance of proposed approach.

https://doi.org/10.1371/journal.pone.0195957.g004

Download:

Table 3. Performance of proposed approach.

https://doi.org/10.1371/journal.pone.0195957.t003

Conclusions

The paper presents a Kriging based spatiotemporal data imputation approach that is able to fully utilize the spatiotemporal information of the traffic data and that does not assume the data follow any distribution. As traffic flows from upstream to downstream, the traffic stream characteristics at a certain location are usually related to those at neighboring locations. So the traffic stream characteristics at upstream and downstream locations can be used to impute the missing value at a specific location. Besides, the traffic characteristics of a specific location at a certain time are also related to those of previous/future days or time periods. Therefore, a Kriging based imputation method that considers both temporal and spatial information is proposed. Compared with KNN and historical average, the proposed method has higher imputation accuracy in ten out of the eleven generated scenarios. Only when the data missing rate is lower than 1%, the performance of the historical average method is better than the proposed imputation method. It suggests that the historical average method is more suitable for the scenarios in which only a few data points are missing. This study also finds that the KNN method has the lowest imputation accuracy. The result of KNN may be more reliable when there are more features to determine the nearest neighbors are available.

Supporting information

S1 Data. Data used in this study.

https://doi.org/10.1371/journal.pone.0195957.s001

(RAR)

References

1. Ma X, Dai Z, He Z, Ma J, Wang Y, Wang Y. Learning Traffic as Images: A Deep Convolutional Neural Network for Large-Scale Transportation Network Speed Prediction. Sensors. 2017;17(4).
- View Article
- Google Scholar
2. Ma X, Yu H, Wang Y, Wang Y. Large-Scale Transportation Network Congestion Evolution Prediction Using Deep Learning Theory. Plos One. 2015;10(3):e0119044. pmid:25780910
- View Article
- PubMed/NCBI
- Google Scholar
3. Ding C, Duan J, Zhang Y, Wu X, Yu G. Using an ARIMA-GARCH Modeling Approach to Improve Subway Short-Term Ridership Forecasting Accounting for Dynamic Volatility. IEEE Transactions on Intelligent Transportation Systems. 2017;PP(99):1–11.
- View Article
- Google Scholar
4. Ding C, Wang D, Ma X, Li H. Predicting Short-Term Subway Ridership and Prioritizing Its Influential Factors Using Gradient Boosting Decision Trees. Sustainability. 2016;8(11):1100.
- View Article
- Google Scholar
5. Ma X, Ding C, Luan S, Wang Y, Wang Y. Prioritizing Influential Factors for Freeway Incident Clearance Time Prediction Using the Gradient Boosting Decision Trees Method. IEEE Transactions on Intelligent Transportation Systems. 2017;18(9):2303–10.
- View Article
- Google Scholar
6. Turner S. Defining and Measuring Traffic Data Quality: White Paper on Recommended Approaches. Transportation Research Record: Journal of the Transportation Research Board. 2004;1870(-1):62–9.
- View Article
- Google Scholar
7. Williams BM, Hoel LA. Modeling and Forecasting Vehicular Traffic Flow as a Seasonal ARIMA Process: Theoretical Basis and Empirical Results. Journal of Transportation Engineering. 2003;129(6):664–72.
- View Article
- Google Scholar
8. Chen C, Kwon J, Rice J, Skabardonis A, Varaiya P. Detecting errors and imputing missing data for single-loop surveillance systems. Transportation Data Research. 2003;1855(1855):160–7. WOS:000189494800020.
- View Article
- Google Scholar
9. Al-Deek H, Venkata C, Ravi Chandra S. New Algorithms for Filtering and Imputation of Real-time and Archived Dual-loop Detector Data in I-4 Data Warehouse. Transportation Research Record: Journal of the Transportation Research Board. 2004;(1867):116–26.
- View Article
- Google Scholar
10. Ni D, John Leonard II. Markov Chain Monte Carlo Multiple Imputation Using Bayesian Networks for Incomplete Intelligent Transportation Systems Data. Transportation Research Record Journal of the Transportation Research Board. 2005;1935(1):57–67.
- View Article
- Google Scholar
11. Qu L, Li L, Zhang Y, Hu J. PPCA-Based Missing Data Imputation for Traffic Flow Volume: A Systematical Approach. IEEE Transactions on Intelligent Transportation Systems. 2009;10(3):512–22.
- View Article
- Google Scholar
12. Li Y, Li Z, Li L. Missing traffic data: comparison of imputation methods. IET Intelligent Transport Systems. 2014;8(1):51–7.
- View Article
- Google Scholar
13. Zhong M, Sharma S, Lingras P. Matching Patterns for Updating Missing Values of Traffic Counts. Transportation Planning and Technology. 2006;29(2):141–56. WOS:000239390800004.
- View Article
- Google Scholar
14. Smith B, Scherer W, Conklin J. Exploring Imputation Techniques for Missing Data in Transportation Management Systems. Transportation Research Record. 2003;1836(1):132–42.
- View Article
- Google Scholar
15. Zhong M, Sharma S, Lingras P. Genetically Designed Models for Accurate Imputation of Missing Traffic Counts. Transportation Research Record: Journal of the Transportation Research Board. 2004;(1879):71–9.
- View Article
- Google Scholar
16. Qu L, Zhang Y, Hu J, Jia L, Li L, editors. A BPCA Based Missing Value Imputing Method for Traffic Flow Volume Data. Intelligent Vehicles Symposium, 2008 IEEE; 2008: IEEE.
17. Liu Z, Sharma S, Datla S. Imputation of Missing Traffic Data during Holiday Periods. Transportation Planning and Technology. 2008;31(5):525–44.
- View Article
- Google Scholar
18. Williams B, Durvasula P, Brown D. Urban Freeway Traffic Flow Prediction: Application of Seasonal Autoregressive Integrated Moving Average and Exponential Smoothing Models. Transportation Research Record: Journal of the Transportation Research Board. 1998;(1644):132–41.
- View Article
- Google Scholar
19. Wu C-H, Ho J-M, Lee D-T. Travel-time Prediction with Support Vector Regression. IEEE transactions on intelligent transportation systems. 2004;5(4):276–81.
- View Article
- Google Scholar
20. Smith BL, Demetsky MJ, editors. Short-term Traffic Flow Prediction Models-A Comparison of Neural Network and Nonparametric Regression Approaches. Systems, Man, and Cybernetics, 1994 Humans, Information and Technology, 1994 IEEE International Conference on; 1994: IEEE.
21. Qi Y, Ishak S. A Hidden Markov Model for Short Term Prediction of Traffic Conditions on Freeways. Transportation Research Part C: Emerging Technologies. 2014;43:95–111. https://doi.org/10.1016/j.trc.2014.02.007.
- View Article
- Google Scholar
22. D’Ambrosio A. Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm. Journal of Classification. 2012;29(2):227–58.
- View Article
- Google Scholar
23. Ma X, Luan S, Du B, Yu B. Spatial Copula Model for Imputing Traffic Flow Data from Remote Microwave Sensors. Sensors. 2017;17(10):2160.
- View Article
- Google Scholar
24. Kamarianakis Y, Prastacos P. Space–time modeling of traffic flow. Computers & Geosciences. 2005;31(2):119–33.
- View Article
- Google Scholar
25. Yang J, Han LD, Freeze PB, editors. Short-Term Freeway Speed Profiling Based on Longitudinal Spatial-Temporal Dynamics. Transportation Research Board 93rd Annual Meeting; 2014.
26. Stathopoulos A, Karlaftis MG. A Multivariate State Space Approach for Urban Traffic Flow Modeling and Prediction. Transportation Research Part C: Emerging Technologies. 2003;11(2):121–35.
- View Article
- Google Scholar
27. Cressie N. The origins of kriging. Mathematical Geology. 1990;22(3):239–52.
- View Article
- Google Scholar
28. Yang H, Cherry CR, Zaretzki R, Ryerson MS, Liu X, Fu Z. A GIS‐Based Method to Identify Cost‐effective Routes for Rural Deviated Fixed Route Transit. Journal of Advanced Transportation. 2016.
- View Article
- Google Scholar
29. Hull J, White A. Incorporating Volatility Updating into the Historical Simulation Method for Value-at-risk. Journal of risk. 1998;1(1):5–19.
- View Article
- Google Scholar
30. Altman NS. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. American Statistician. 1992;46(3):175–85.
- View Article
- Google Scholar
31. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing Value Estimation Methods for DNA Microarrays. Bioinformatics. 2001;17(6):520–5. pmid:11395428
- View Article
- PubMed/NCBI
- Google Scholar
32. Potts JF, Marshall MA, Crockett EC, Washington J. A Guide for Planning and Operating Flexible Public Transportation Services2010.
- View Article
- Google Scholar
33. Yang H, Cherry C. Statewide Rural-Urban Bus Travel Demand and Network Evaluation: An Application in Tennessee. Journal of Public Transportation. 2012;15(3):97–111.
- View Article
- Google Scholar
34. Yang H, Cherry CR. Use characteristics and demographics of rural transit riders: a case study in Tennessee. Transportation Planning and Technology. 2017;40(2):213–27.
- View Article
- Google Scholar

[ref1] 1. Ma X, Dai Z, He Z, Ma J, Wang Y, Wang Y. Learning Traffic as Images: A Deep Convolutional Neural Network for Large-Scale Transportation Network Speed Prediction. Sensors. 2017;17(4).
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Ma X, Yu H, Wang Y, Wang Y. Large-Scale Transportation Network Congestion Evolution Prediction Using Deep Learning Theory. Plos One. 2015;10(3):e0119044. pmid:25780910
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Ding C, Duan J, Zhang Y, Wu X, Yu G. Using an ARIMA-GARCH Modeling Approach to Improve Subway Short-Term Ridership Forecasting Accounting for Dynamic Volatility. IEEE Transactions on Intelligent Transportation Systems. 2017;PP(99):1–11.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref4] 4. Ding C, Wang D, Ma X, Li H. Predicting Short-Term Subway Ridership and Prioritizing Its Influential Factors Using Gradient Boosting Decision Trees. Sustainability. 2016;8(11):1100.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Ma X, Ding C, Luan S, Wang Y, Wang Y. Prioritizing Influential Factors for Freeway Incident Clearance Time Prediction Using the Gradient Boosting Decision Trees Method. IEEE Transactions on Intelligent Transportation Systems. 2017;18(9):2303–10.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref6] 6. Turner S. Defining and Measuring Traffic Data Quality: White Paper on Recommended Approaches. Transportation Research Record: Journal of the Transportation Research Board. 2004;1870(-1):62–9.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref7] 7. Williams BM, Hoel LA. Modeling and Forecasting Vehicular Traffic Flow as a Seasonal ARIMA Process: Theoretical Basis and Empirical Results. Journal of Transportation Engineering. 2003;129(6):664–72.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref8] 8. Chen C, Kwon J, Rice J, Skabardonis A, Varaiya P. Detecting errors and imputing missing data for single-loop surveillance systems. Transportation Data Research. 2003;1855(1855):160–7. WOS:000189494800020.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref9] 9. Al-Deek H, Venkata C, Ravi Chandra S. New Algorithms for Filtering and Imputation of Real-time and Archived Dual-loop Detector Data in I-4 Data Warehouse. Transportation Research Record: Journal of the Transportation Research Board. 2004;(1867):116–26.
View Article
Google Scholar

[27] View Article

[28] Google Scholar

[ref10] 10. Ni D, John Leonard II. Markov Chain Monte Carlo Multiple Imputation Using Bayesian Networks for Incomplete Intelligent Transportation Systems Data. Transportation Research Record Journal of the Transportation Research Board. 2005;1935(1):57–67.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref11] 11. Qu L, Li L, Zhang Y, Hu J. PPCA-Based Missing Data Imputation for Traffic Flow Volume: A Systematical Approach. IEEE Transactions on Intelligent Transportation Systems. 2009;10(3):512–22.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref12] 12. Li Y, Li Z, Li L. Missing traffic data: comparison of imputation methods. IET Intelligent Transport Systems. 2014;8(1):51–7.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref13] 13. Zhong M, Sharma S, Lingras P. Matching Patterns for Updating Missing Values of Traffic Counts. Transportation Planning and Technology. 2006;29(2):141–56. WOS:000239390800004.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref14] 14. Smith B, Scherer W, Conklin J. Exploring Imputation Techniques for Missing Data in Transportation Management Systems. Transportation Research Record. 2003;1836(1):132–42.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref15] 15. Zhong M, Sharma S, Lingras P. Genetically Designed Models for Accurate Imputation of Missing Traffic Counts. Transportation Research Record: Journal of the Transportation Research Board. 2004;(1879):71–9.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref16] 16. Qu L, Zhang Y, Hu J, Jia L, Li L, editors. A BPCA Based Missing Value Imputing Method for Traffic Flow Volume Data. Intelligent Vehicles Symposium, 2008 IEEE; 2008: IEEE.

[ref17] 17. Liu Z, Sharma S, Datla S. Imputation of Missing Traffic Data during Holiday Periods. Transportation Planning and Technology. 2008;31(5):525–44.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref18] 18. Williams B, Durvasula P, Brown D. Urban Freeway Traffic Flow Prediction: Application of Seasonal Autoregressive Integrated Moving Average and Exponential Smoothing Models. Transportation Research Record: Journal of the Transportation Research Board. 1998;(1644):132–41.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref19] 19. Wu C-H, Ho J-M, Lee D-T. Travel-time Prediction with Support Vector Regression. IEEE transactions on intelligent transportation systems. 2004;5(4):276–81.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref20] 20. Smith BL, Demetsky MJ, editors. Short-term Traffic Flow Prediction Models-A Comparison of Neural Network and Nonparametric Regression Approaches. Systems, Man, and Cybernetics, 1994 Humans, Information and Technology, 1994 IEEE International Conference on; 1994: IEEE.

[ref21] 21. Qi Y, Ishak S. A Hidden Markov Model for Short Term Prediction of Traffic Conditions on Freeways. Transportation Research Part C: Emerging Technologies. 2014;43:95–111. https://doi.org/10.1016/j.trc.2014.02.007.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref22] 22. D’Ambrosio A. Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm. Journal of Classification. 2012;29(2):227–58.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref23] 23. Ma X, Luan S, Du B, Yu B. Spatial Copula Model for Imputing Traffic Flow Data from Remote Microwave Sensors. Sensors. 2017;17(10):2160.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref24] 24. Kamarianakis Y, Prastacos P. Space–time modeling of traffic flow. Computers & Geosciences. 2005;31(2):119–33.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref25] 25. Yang J, Han LD, Freeze PB, editors. Short-Term Freeway Speed Profiling Based on Longitudinal Spatial-Temporal Dynamics. Transportation Research Board 93rd Annual Meeting; 2014.

[ref26] 26. Stathopoulos A, Karlaftis MG. A Multivariate State Space Approach for Urban Traffic Flow Modeling and Prediction. Transportation Research Part C: Emerging Technologies. 2003;11(2):121–35.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref27] 27. Cressie N. The origins of kriging. Mathematical Geology. 1990;22(3):239–52.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref28] 28. Yang H, Cherry CR, Zaretzki R, Ryerson MS, Liu X, Fu Z. A GIS‐Based Method to Identify Cost‐effective Routes for Rural Deviated Fixed Route Transit. Journal of Advanced Transportation. 2016.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref29] 29. Hull J, White A. Incorporating Volatility Updating into the Historical Simulation Method for Value-at-risk. Journal of risk. 1998;1(1):5–19.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref30] 30. Altman NS. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. American Statistician. 1992;46(3):175–85.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref31] 31. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing Value Estimation Methods for DNA Microarrays. Bioinformatics. 2001;17(6):520–5. pmid:11395428
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref32] 32. Potts JF, Marshall MA, Crockett EC, Washington J. A Guide for Planning and Operating Flexible Public Transportation Services2010.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref33] 33. Yang H, Cherry C. Statewide Rural-Urban Bus Travel Demand and Network Evaluation: An Application in Tennessee. Journal of Public Transportation. 2012;15(3):97–111.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref34] 34. Yang H, Cherry CR. Use characteristics and demographics of rural transit riders: a case study in Tennessee. Transportation Planning and Technology. 2017;40(2):213–27.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

Figures

Abstract

Introduction

Methodology

Kriging based spatiotemporal imputation approach

Background about Kriging.

Kriging based spatiotemporal imputation.

Benchmark imputation methods for comparison

Historical average (same time and weekdays and same stations).

K-nearest neighborhood.

Evaluation criteria

Data source and study locations

Data missing rates

Evaluation of imputation performance

Experiment design of data missing scenarios

Imputation performance

Conclusions

Supporting information

S1 Data. Data used in this study.

References