Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis

https://doi.org/10.1016/j.aap.2019.105405Get rights and content

Highlights

  • Develop an XGBoost model to detect accidents with detection rate of 79 %, and AUC of 89 %.

  • The developed model is robust and interpretable thanks to SHAP.

  • Complex interrelated impacts of selected features are captured and analyzed.

Abstract

Detecting traffic accidents as rapidly as possible is essential for traffic safety. In this study, we use eXtreme Gradient Boosting (XGBoost)—a Machine Learning (ML) technique—to detect the occurrence of accidents using a set of real time data comprised of traffic, network, demographic, land use, and weather features. The data used from the Chicago metropolitan expressways was collected between December 2016 and December 2017, and it includes 244 traffic accidents and 6073 non-accident cases. In addition, SHAP (SHapley Additive exPlanation) is employed to interpret the results and analyze the importance of individual features. The results show that XGBoost can detect accidents robustly with an accuracy, detection rate, and a false alarm rate of 99 %, 79 %, and 0.16 %, respectively. Several traffic related features, especially difference of speed between 5 min before and 5 min after an accident, are found to have relatively more impact on the occurrence of accidents. Furthermore, a feature dependency analysis is conducted for three pairs of features. First, average daily traffic and speed after accidents/non-accidents time at the upstream location are interpreted jointly. Then, distance to Central Business District and residential density are analyzed. Finally, speed after accidents/non-accidents time at upstream location and speed after accidents/non-accidents time at downstream location are evaluated with respect to the model’s output.

Introduction

The occurrence of traffic accidents is a major concern in countries worldwide. With a rapid increase in the number of highways and motorized vehicles in most countries (Global status report on road safety, 2015), the total number of accidents has increased substantially in the world, and an annual report from the National Highway Traffic Safety Administration (NHTSA) reported that around 5,000,000 traffic accidents occur in the United States (US) each year (Traffic Safety Facts, 2013). In fact, traffic accidents have become the second main cause of death for young people and the third reason of death for people who are between 30 and 44 in the US (Traffic safety facts, 2013). Moreover, traffic accidents resulting in death or injury increased by 3 % and 2 %, respectively, between 2011 and 2012 (Traffic safety facts, 2013). In the world, the World Health Organization reported that 1.25 million people die in traffic accidents every year (Global status report on road safety, 2015). Beyond the human health impacts, traffic accidents also generate negative impacts on traffic, especially on highways (Azimi et al., 2019), at intersections (Arvin et al., 2019a), and in work zones (Mokhtarimousavi et al., 2019), often resulting in increased congestion and emission and the worsening of other factors if accidents are not detected properly and rapidly (Traffic Incident Management, 2013).

The transport community has been increasingly making use of novel computational techniques and new data sources (Sharifi et al., 2019; Golshani et al., 2018; Parsa et al., 2019a; Nasr Esfahani et al., 2019; Razi-Ardakani et al., 2018; Ahangari et al., 2019), which have also facilitated the prediction (Mansourkhaki et al., 2016, 2017), detection (Parsa et al., 2019b) and estimation of severity (Azimi et al., 2020; Arvin et al., 2019b) of accidents. For example, smartphones are equipped with several sensors such as an accelerometer, a magnetometer, and a gyroscope that can provide data to detect accidents (Alwan et al., 2016; Fernandes et al., 2016). In fact, traffic information from smartphones has been integrated with other sources of data such as NetLogo simulated data (Thomas and Vidal, 2017) and airbag triggers data (Zaldivar et al., 2011) to detect accidents. These types of data rarely available, however, and they tend to be expensive to acquire (White et al., 2011). Other potential data sources include visual data such as photo and video. Several studies have utilized these data sources along with different techniques including matrix approximation (Xia et al., 2015), statistic heuristic method (Maaloul et al., 2017), hybrid support vector machine with extended Kalman filter (Vishnu and Nedunchezhian, 2018), and extreme learning machine (Chen et al., 2016b) to detect accidents. Vision-based accident detection requires a large amount of information provided by photos and videos, however, and it requires large storage capacity (Chen et al., 2016b). In addition, the accuracy of vision-based accident detection models is significantly affected by weather condition and picture/frame resolution (Xia et al., 2015). Social media is another new data source that can be used to detect accidents for instance through crawling, processing, and filtering tweets (Gu et al., 2016). As an example, Deep Belief Network (DBN) and Long Short-Term Memory (LSTM) are two deep learning models that have been used to detect accidents from Twitter content (Zhang et al., 2018). In these models, it is preferable to fuse social media with traffic data to achieve higher performance (Zhang and He, 2016). In fact, social media data are mostly considered as a supplement than a main source of data for accident detection since they tend to be less available than traffic data (Schulz et al., 2013). Moreover, the precision of the location of accidents reported in social media is generally poor (Zhang and He, 2016).

In the end, traditional traffic data offers a rich and relatively available source of data that can be used for accident detection (Amin and Jalil, 2012; Xu et al., 2016a). In particular, loop detector data is available for most highways and expressways in the US. Many techniques have been used to detect accidents using traffic data (e.g., conditional logistic regression (Kwak and Kho, 2016)) as well as dynamic methods that are able to detect shock waves caused by an accident to detect secondary accidents (Wang et al., 2016). Moreover, many machine learning models have been used to detect accidents, including k-nearest neighbor (Ozbayoglu et al., 2017), regression tree (Ozbayoglu et al., 2017), feed-forward neural network (Ozbayoglu et al., 2017), support vector machine (Dong et al., 2015), probabilistic neural network (Parsa et al., 2019c), dynamic Bayesian network (Sun and Sun, 2015), and deep learning (Chen et al., 2016a; Parsa et al., 2019a). To this end, traffic data is often considering to be best suited to detect and predict the occurrence of accidents (Xu et al., 2016b).

eXtreme Gradient Boosting (XGBoost) is a relatively new method, initially proposed by Chen and Guestrin in 2016 (Chen and Guestrin, 2016), that generally generates high accuracy and fast processing time while being computationally less costly and less complex (Chen and Guestrin, 2016; Hamilton et al., 2019). Meng (2018) has already leveraged XGBoost to predict the occurrence and the duration of accidents using several data sources including road geometric design, historical accident data, as well as traffic and weather data. Furthermore, two studies (Hamilton et al., 2019; Schlögl et al., 2019) showed that XGBoost performed better than several other machine learning techniques to predict the likelihood of an accident including Logistic Regression, Bayesian Regularized Neural Network, Pegasos SVM, Bagging Average Neural Networks, Deep Neural Network, and Gradient Boosting. Shan et al. (2018) also employed artificial neural network to integrate multiple XGBoost models together to predict accident duration. Finally, XGBoost has also been used to predict the severity of traffic accidents, and it achieved high performance, especially when using spatial data (Mokoatle et al., 2019).

In terms of model interpretation—which is especially important when using ML models that are often difficult to interpret—several studies have started to take advantage of SHAP (SHapley Additive exPlanation) (Ribeiro et al., 2016; Štrumbelj and Kononenko, 2014). SHAP was initially proposed by Shapley in 1953 and it is based on game theory (Shapley, 1953). It offers a powerful and insightful measure of the importance of a feature in a model. In 2017, Lundberg and Lee developed a practical package in Python that is able to calculate SHAP for different techniques including LightGBM, GBoost, CatBoost, XGBoost, and Scikit-learn tree models (Lundberg and Lee, 2017). Having known advantages of SHAP, researchers started to utilize this technique (Movahedi and Derrible, 2020). In traffic safety, Mihaita et al. (Mihaita et al., 2019) employed SHAP in 2019 to analyze the impact of different features on accident duration.

The main objective of this study is both to assess the performance of XGBoost to detect the occurrence of accidents in real time and to analyze the importance of individual features for accident detection using SHAP. This work includes traffic, network, demographic, land use, and weather data sources. The data pre-processing procedure adopted in this article is similar to Parsa et al. (2019b)—from the authors of this article—where we used Synthetic Minority Oversampling Technique (SMOTE) to prepare the dataset, and support vector machine and probabilistic neural network for accident detection modeling. The knowledge gained from the study can help urban planners to evaluate (Kashani et al., 2019) and inform policy decisions.

Semantically, we note that the terms variable and feature are identical. The former tends to be used in statistics and the latter tends to be used in computer science. In this article, we will use feature in line with most works that use XGBoost and SHAP.

The rest of this article is organized as follows. First, the data sources used in this study and the feature extraction procedure are presented. Then, XGBoost and SHAP are reviewed in depth in the methodology section. Finally, the performance of the accident detection model is analyzed and discussed, a comprehensive features interpretation is provided through SHAP.

Section snippets

Data analysis and feature extraction

The accident data used in this study were collected and archived by the Illinois Department of Transportation (IDOT). The dataset includes 244 accident cases that occurred on the Chicago metropolitan expressways between December 2016 and December 2017. Fig. 1 displays the location of these accidents. In addition, 6073 non-accident cases are selected randomly from the same time period. Each accident/non-accident case occurred on a section that has a loop detector located at the beginning and end

Methodology

In this study, XGBoost is used to model accident detection. XGBoost is an efficient implementation of gradient boosted decision trees. A Decision Tree (DT) has a structure similar to a tree with a root node (topmost node), internal nodes, and leaf nodes (end nodes). DT algorithms generally use simple rules to start from the root node and branch out, going through internal nodes, to finally end up in the leaves. In contrast, gradient boosted decision tree is an ensemble learning technique that

XGBoost model

In this work, XGBoost is trained on 65 % of the data that is selected randomly, and the remaining 35 % is used to test the model. In addition, a 10-fold cross-validation process is employed on the training data to test the stability of the model performance—that is, the training data is divided into ten subsamples randomly, and ten models are trained in such a way that each time nine subsamples are used to train a model and one subsample is used to test a model.

Fig. 2 displays the results of

Summary and conclusion

In this study, XGBoost is trained to model accident detection using a set of real-time data extracted and generated from different data sources. In total, 244 accident cases and 6073 non-accident cases are used to train the model that achieved an accuracy of 99 %, detection rate of 79 %, and false alarm rate of 0.16 %. Feature importance analysis is applied to the final model using SHAP, and traffic related features (especially speed) is found to have a substantial impact on the probability of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The research leading to these findings has received funding from the Illinois Department of Transportation (IDOT) and from the National Science Foundation (NSF) CAREER award #1551731. The authors would also like to thank IDOT for collecting and archiving the loop detectors data used in this study.

References (64)

  • M. Schlögl et al.

    A comparison of statistical learning methods for deriving determining factors of accident occurrence from an imbalanced high resolution dataset

    Accid. Anal. Prev.

    (2019)
  • S. Soleimani et al.

    A comprehensive railroad-highway grade crossing consolidation model: a machine learning approach

    Accid. Anal. Prev.

    (2019)
  • Jie Sun et al.

    A dynamic Bayesian network model for real-time crash prediction using traffic speed conditions data

    Transp. Res. Part C

    (2015)
  • J. Wang et al.

    Identification of freeway secondary accidents with traffic shock wave detected by loop detectors

    Saf. Sci.

    (2016)
  • C. Xu et al.

    Real-time estimation of secondary crash likelihood on freeways using high-resolution loop detector data

    Transp. Res. Part C

    (2016)
  • Z. Zhang et al.

    A deep learning approach for detecting tra ffi c accidents from social media data

    Transp. Res. Part C Emerg. Technol.

    (2018)
  • S. Ahangari et al.

    A Machine Learning Distracted Driving Prediction Model

    International Symposium of Intelligent Unmanned Systems on Artificial Intelligence

    (2019)
  • Z.S. Alwan et al.

    Car accident detection and notification system using smartphone

    Int. J. Comput. Sci. Mob. Comput.

    (2016)
  • S. Amin et al.

    Accident detection and reporting system using GPS, GPRS and GSM technology

    Int. Conf. Informatics, Electron. Vis

    (2012)
  • R. Arvin et al.

    The role of pre-crash driving instability in contributing to crash intensity using naturalistic driving data

    Accident Analysis & Prevention

    (2019)
  • G. Azimi et al.

    Investigation of heterogeneity in severity analysis for large truck crashes

    98th Annu. Meet. Transp. Res. Board

    (2019)
  • W. Badr

    Why feature correlation matters…. A lot! Towar

    Data Sci.

    (2019)
  • N. Chawla et al.

    SMOTE: synthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • Q. Chen et al.

    Learning deep representation from big and heterogeneous data for traffic accident inference

    30th AAAI Conf. Artif. Intell.

    (2016)
  • T. Chen et al.

    Xgboost: a scalable tree boosting system

    Proc. 22nd acm sigkdd Int. Conf. Knowl. Discov. Data Min.

    (2016)
  • Y. Chen et al.

    A vision based traffic accident detection method using extreme learning machine

    Int. Conf. Adv. Robot. Mechatronics

    (2016)
  • D. Clark

    Chicago Metropolitan Agency for Planning’s 2013 Land Use Inventory for Northeastern Illinois, Version 1.0. Chicago Metrop. Agency Plan

    (2016)
  • J.H. Friedman

    Greedy function approximation: a gradient boosting machine

    Ann. Stat.

    (2001)
  • Global status report on road safety 2015, 2015. World Heal....
  • B.A. Hamilton et al.

    An eXtreme gradient boosting method for identifying the factors contributing to crash/near-crash events: a naturalistic driving study

    Can. J. Civ. Eng.

    (2019)
  • H. Han et al.

    Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

    Int. Conf. Intell. Comput.

    (2005)
  • P. Kaur et al.

    Comparing the behavior of oversampling and undersampling approach of class imbalance learning by combining class imbalance problem with noise

    ICT Based Innov.

    (2018)
  • Cited by (484)

    • Determining causality in travel mode choice

      2024, Travel Behaviour and Society
    View all citing articles on Scopus
    View full text