Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis
Introduction
The occurrence of traffic accidents is a major concern in countries worldwide. With a rapid increase in the number of highways and motorized vehicles in most countries (Global status report on road safety, 2015), the total number of accidents has increased substantially in the world, and an annual report from the National Highway Traffic Safety Administration (NHTSA) reported that around 5,000,000 traffic accidents occur in the United States (US) each year (Traffic Safety Facts, 2013). In fact, traffic accidents have become the second main cause of death for young people and the third reason of death for people who are between 30 and 44 in the US (Traffic safety facts, 2013). Moreover, traffic accidents resulting in death or injury increased by 3 % and 2 %, respectively, between 2011 and 2012 (Traffic safety facts, 2013). In the world, the World Health Organization reported that 1.25 million people die in traffic accidents every year (Global status report on road safety, 2015). Beyond the human health impacts, traffic accidents also generate negative impacts on traffic, especially on highways (Azimi et al., 2019), at intersections (Arvin et al., 2019a), and in work zones (Mokhtarimousavi et al., 2019), often resulting in increased congestion and emission and the worsening of other factors if accidents are not detected properly and rapidly (Traffic Incident Management, 2013).
The transport community has been increasingly making use of novel computational techniques and new data sources (Sharifi et al., 2019; Golshani et al., 2018; Parsa et al., 2019a; Nasr Esfahani et al., 2019; Razi-Ardakani et al., 2018; Ahangari et al., 2019), which have also facilitated the prediction (Mansourkhaki et al., 2016, 2017), detection (Parsa et al., 2019b) and estimation of severity (Azimi et al., 2020; Arvin et al., 2019b) of accidents. For example, smartphones are equipped with several sensors such as an accelerometer, a magnetometer, and a gyroscope that can provide data to detect accidents (Alwan et al., 2016; Fernandes et al., 2016). In fact, traffic information from smartphones has been integrated with other sources of data such as NetLogo simulated data (Thomas and Vidal, 2017) and airbag triggers data (Zaldivar et al., 2011) to detect accidents. These types of data rarely available, however, and they tend to be expensive to acquire (White et al., 2011). Other potential data sources include visual data such as photo and video. Several studies have utilized these data sources along with different techniques including matrix approximation (Xia et al., 2015), statistic heuristic method (Maaloul et al., 2017), hybrid support vector machine with extended Kalman filter (Vishnu and Nedunchezhian, 2018), and extreme learning machine (Chen et al., 2016b) to detect accidents. Vision-based accident detection requires a large amount of information provided by photos and videos, however, and it requires large storage capacity (Chen et al., 2016b). In addition, the accuracy of vision-based accident detection models is significantly affected by weather condition and picture/frame resolution (Xia et al., 2015). Social media is another new data source that can be used to detect accidents for instance through crawling, processing, and filtering tweets (Gu et al., 2016). As an example, Deep Belief Network (DBN) and Long Short-Term Memory (LSTM) are two deep learning models that have been used to detect accidents from Twitter content (Zhang et al., 2018). In these models, it is preferable to fuse social media with traffic data to achieve higher performance (Zhang and He, 2016). In fact, social media data are mostly considered as a supplement than a main source of data for accident detection since they tend to be less available than traffic data (Schulz et al., 2013). Moreover, the precision of the location of accidents reported in social media is generally poor (Zhang and He, 2016).
In the end, traditional traffic data offers a rich and relatively available source of data that can be used for accident detection (Amin and Jalil, 2012; Xu et al., 2016a). In particular, loop detector data is available for most highways and expressways in the US. Many techniques have been used to detect accidents using traffic data (e.g., conditional logistic regression (Kwak and Kho, 2016)) as well as dynamic methods that are able to detect shock waves caused by an accident to detect secondary accidents (Wang et al., 2016). Moreover, many machine learning models have been used to detect accidents, including k-nearest neighbor (Ozbayoglu et al., 2017), regression tree (Ozbayoglu et al., 2017), feed-forward neural network (Ozbayoglu et al., 2017), support vector machine (Dong et al., 2015), probabilistic neural network (Parsa et al., 2019c), dynamic Bayesian network (Sun and Sun, 2015), and deep learning (Chen et al., 2016a; Parsa et al., 2019a). To this end, traffic data is often considering to be best suited to detect and predict the occurrence of accidents (Xu et al., 2016b).
eXtreme Gradient Boosting (XGBoost) is a relatively new method, initially proposed by Chen and Guestrin in 2016 (Chen and Guestrin, 2016), that generally generates high accuracy and fast processing time while being computationally less costly and less complex (Chen and Guestrin, 2016; Hamilton et al., 2019). Meng (2018) has already leveraged XGBoost to predict the occurrence and the duration of accidents using several data sources including road geometric design, historical accident data, as well as traffic and weather data. Furthermore, two studies (Hamilton et al., 2019; Schlögl et al., 2019) showed that XGBoost performed better than several other machine learning techniques to predict the likelihood of an accident including Logistic Regression, Bayesian Regularized Neural Network, Pegasos SVM, Bagging Average Neural Networks, Deep Neural Network, and Gradient Boosting. Shan et al. (2018) also employed artificial neural network to integrate multiple XGBoost models together to predict accident duration. Finally, XGBoost has also been used to predict the severity of traffic accidents, and it achieved high performance, especially when using spatial data (Mokoatle et al., 2019).
In terms of model interpretation—which is especially important when using ML models that are often difficult to interpret—several studies have started to take advantage of SHAP (SHapley Additive exPlanation) (Ribeiro et al., 2016; Štrumbelj and Kononenko, 2014). SHAP was initially proposed by Shapley in 1953 and it is based on game theory (Shapley, 1953). It offers a powerful and insightful measure of the importance of a feature in a model. In 2017, Lundberg and Lee developed a practical package in Python that is able to calculate SHAP for different techniques including LightGBM, GBoost, CatBoost, XGBoost, and Scikit-learn tree models (Lundberg and Lee, 2017). Having known advantages of SHAP, researchers started to utilize this technique (Movahedi and Derrible, 2020). In traffic safety, Mihaita et al. (Mihaita et al., 2019) employed SHAP in 2019 to analyze the impact of different features on accident duration.
The main objective of this study is both to assess the performance of XGBoost to detect the occurrence of accidents in real time and to analyze the importance of individual features for accident detection using SHAP. This work includes traffic, network, demographic, land use, and weather data sources. The data pre-processing procedure adopted in this article is similar to Parsa et al. (2019b)—from the authors of this article—where we used Synthetic Minority Oversampling Technique (SMOTE) to prepare the dataset, and support vector machine and probabilistic neural network for accident detection modeling. The knowledge gained from the study can help urban planners to evaluate (Kashani et al., 2019) and inform policy decisions.
Semantically, we note that the terms variable and feature are identical. The former tends to be used in statistics and the latter tends to be used in computer science. In this article, we will use feature in line with most works that use XGBoost and SHAP.
The rest of this article is organized as follows. First, the data sources used in this study and the feature extraction procedure are presented. Then, XGBoost and SHAP are reviewed in depth in the methodology section. Finally, the performance of the accident detection model is analyzed and discussed, a comprehensive features interpretation is provided through SHAP.
Section snippets
Data analysis and feature extraction
The accident data used in this study were collected and archived by the Illinois Department of Transportation (IDOT). The dataset includes 244 accident cases that occurred on the Chicago metropolitan expressways between December 2016 and December 2017. Fig. 1 displays the location of these accidents. In addition, 6073 non-accident cases are selected randomly from the same time period. Each accident/non-accident case occurred on a section that has a loop detector located at the beginning and end
Methodology
In this study, XGBoost is used to model accident detection. XGBoost is an efficient implementation of gradient boosted decision trees. A Decision Tree (DT) has a structure similar to a tree with a root node (topmost node), internal nodes, and leaf nodes (end nodes). DT algorithms generally use simple rules to start from the root node and branch out, going through internal nodes, to finally end up in the leaves. In contrast, gradient boosted decision tree is an ensemble learning technique that
XGBoost model
In this work, XGBoost is trained on 65 % of the data that is selected randomly, and the remaining 35 % is used to test the model. In addition, a 10-fold cross-validation process is employed on the training data to test the stability of the model performance—that is, the training data is divided into ten subsamples randomly, and ten models are trained in such a way that each time nine subsamples are used to train a model and one subsample is used to test a model.
Fig. 2 displays the results of
Summary and conclusion
In this study, XGBoost is trained to model accident detection using a set of real-time data extracted and generated from different data sources. In total, 244 accident cases and 6073 non-accident cases are used to train the model that achieved an accuracy of 99 %, detection rate of 79 %, and false alarm rate of 0.16 %. Feature importance analysis is applied to the final model using SHAP, and traffic related features (especially speed) is found to have a substantial impact on the probability of
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The research leading to these findings has received funding from the Illinois Department of Transportation (IDOT) and from the National Science Foundation (NSF) CAREER award #1551731. The authors would also like to thank IDOT for collecting and archiving the loop detectors data used in this study.
References (64)
- et al.
How instantaneous driving behavior contributes to crashes at intersections: extracting useful information from connected vehicle message data
Accid. Anal. Prev.
(2019) - et al.
Severity analysis for large truck rollover crashes using a random parameter ordered logit model
Accident Analysis & Prevention
(2020) - et al.
Examining associations between urban design attributes and transport mode choice for walking, cycling, public transport and private motor vehicle trips
J. Transp. Health
(2017) - et al.
Support vector machine in crash prediction at the level of traffic analysis zones: assessing the spatial proximity effects
Accid. Anal. Prev.
(2015) - et al.
Automatic accident detection with multi-modal alert system implementation for ITS
Veh. Commun.
(2016) - et al.
Modeling travel mode and timing decisions: comparison of artificial neural networks and copula-based joint model
Travel Behav. Soc.
(2018) - et al.
From Twitter to detector: real-time traffic incident detection using social media data
Transp. Res. Part C
(2016) - et al.
An agent-based simulation model to evaluate the response to seismic retrofit promotion policies
International Journal of Disaster Risk Reduction
(2019) - et al.
Predicting crash risk and identifying crash precursors on Korean expressways using loop detector data
Accid. Anal. Prev.
(2016) - et al.
Real-time accident detection: coping with imbalanced data
Accid. Anal. Prev.
(2019)
A comparison of statistical learning methods for deriving determining factors of accident occurrence from an imbalanced high resolution dataset
Accid. Anal. Prev.
A comprehensive railroad-highway grade crossing consolidation model: a machine learning approach
Accid. Anal. Prev.
A dynamic Bayesian network model for real-time crash prediction using traffic speed conditions data
Transp. Res. Part C
Identification of freeway secondary accidents with traffic shock wave detected by loop detectors
Saf. Sci.
Real-time estimation of secondary crash likelihood on freeways using high-resolution loop detector data
Transp. Res. Part C
A deep learning approach for detecting tra ffi c accidents from social media data
Transp. Res. Part C Emerg. Technol.
A Machine Learning Distracted Driving Prediction Model
International Symposium of Intelligent Unmanned Systems on Artificial Intelligence
Car accident detection and notification system using smartphone
Int. J. Comput. Sci. Mob. Comput.
Accident detection and reporting system using GPS, GPRS and GSM technology
Int. Conf. Informatics, Electron. Vis
The role of pre-crash driving instability in contributing to crash intensity using naturalistic driving data
Accident Analysis & Prevention
Investigation of heterogeneity in severity analysis for large truck crashes
98th Annu. Meet. Transp. Res. Board
Why feature correlation matters…. A lot! Towar
Data Sci.
SMOTE: synthetic minority over-sampling technique
J. Artif. Intell. Res.
Learning deep representation from big and heterogeneous data for traffic accident inference
30th AAAI Conf. Artif. Intell.
Xgboost: a scalable tree boosting system
Proc. 22nd acm sigkdd Int. Conf. Knowl. Discov. Data Min.
A vision based traffic accident detection method using extreme learning machine
Int. Conf. Adv. Robot. Mechatronics
Chicago Metropolitan Agency for Planning’s 2013 Land Use Inventory for Northeastern Illinois, Version 1.0. Chicago Metrop. Agency Plan
Greedy function approximation: a gradient boosting machine
Ann. Stat.
An eXtreme gradient boosting method for identifying the factors contributing to crash/near-crash events: a naturalistic driving study
Can. J. Civ. Eng.
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning
Int. Conf. Intell. Comput.
Comparing the behavior of oversampling and undersampling approach of class imbalance learning by combining class imbalance problem with noise
ICT Based Innov.
Cited by (484)
Determining causality in travel mode choice
2024, Travel Behaviour and SocietyOptimizing building spatial morphology to alleviate human thermal stress
2024, Sustainable Cities and SocietyAn interpretable deep learning multi-dimensional integration framework for exchange rate forecasting based on deep and shallow feature selection and snapshot ensemble technology
2024, Engineering Applications of Artificial IntelligenceCO<inf>2</inf> fluxes contrast between aquaculture ponds and mangrove forests and its implications for coastal wetland rehabilitation in Leizhou Peninsula, China
2024, Agriculture, Ecosystems and EnvironmentA review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation
2024, Expert Systems with ApplicationsSpatiotemporal features of traffic help reduce automatic accident detection time[Formula presented]
2024, Expert Systems with Applications