Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis

doi:10.1016/j.aap.2019.105405

Accident Analysis & Prevention

Volume 136, March 2020, 105405

https://doi.org/10.1016/j.aap.2019.105405 Get rights and content

Highlights

•
Develop an XGBoost model to detect accidents with detection rate of 79 %, and AUC of 89 %.
•
The developed model is robust and interpretable thanks to SHAP.
•
Complex interrelated impacts of selected features are captured and analyzed.

Abstract

Detecting traffic accidents as rapidly as possible is essential for traffic safety. In this study, we use eXtreme Gradient Boosting (XGBoost)—a Machine Learning (ML) technique—to detect the occurrence of accidents using a set of real time data comprised of traffic, network, demographic, land use, and weather features. The data used from the Chicago metropolitan expressways was collected between December 2016 and December 2017, and it includes 244 traffic accidents and 6073 non-accident cases. In addition, SHAP (SHapley Additive exPlanation) is employed to interpret the results and analyze the importance of individual features. The results show that XGBoost can detect accidents robustly with an accuracy, detection rate, and a false alarm rate of 99 %, 79 %, and 0.16 %, respectively. Several traffic related features, especially difference of speed between 5 min before and 5 min after an accident, are found to have relatively more impact on the occurrence of accidents. Furthermore, a feature dependency analysis is conducted for three pairs of features. First, average daily traffic and speed after accidents/non-accidents time at the upstream location are interpreted jointly. Then, distance to Central Business District and residential density are analyzed. Finally, speed after accidents/non-accidents time at upstream location and speed after accidents/non-accidents time at downstream location are evaluated with respect to the model’s output.

Introduction

The occurrence of traffic accidents is a major concern in countries worldwide. With a rapid increase in the number of highways and motorized vehicles in most countries (Global status report on road safety, 2015), the total number of accidents has increased substantially in the world, and an annual report from the National Highway Traffic Safety Administration (NHTSA) reported that around 5,000,000 traffic accidents occur in the United States (US) each year (Traffic Safety Facts, 2013). In fact, traffic accidents have become the second main cause of death for young people and the third reason of death for people who are between 30 and 44 in the US (Traffic safety facts, 2013). Moreover, traffic accidents resulting in death or injury increased by 3 % and 2 %, respectively, between 2011 and 2012 (Traffic safety facts, 2013). In the world, the World Health Organization reported that 1.25 million people die in traffic accidents every year (Global status report on road safety, 2015). Beyond the human health impacts, traffic accidents also generate negative impacts on traffic, especially on highways (Azimi et al., 2019), at intersections (Arvin et al., 2019a), and in work zones (Mokhtarimousavi et al., 2019), often resulting in increased congestion and emission and the worsening of other factors if accidents are not detected properly and rapidly (Traffic Incident Management, 2013).

The transport community has been increasingly making use of novel computational techniques and new data sources (Sharifi et al., 2019; Golshani et al., 2018; Parsa et al., 2019a; Nasr Esfahani et al., 2019; Razi-Ardakani et al., 2018; Ahangari et al., 2019), which have also facilitated the prediction (Mansourkhaki et al., 2016, 2017), detection (Parsa et al., 2019b) and estimation of severity (Azimi et al., 2020; Arvin et al., 2019b) of accidents. For example, smartphones are equipped with several sensors such as an accelerometer, a magnetometer, and a gyroscope that can provide data to detect accidents (Alwan et al., 2016; Fernandes et al., 2016). In fact, traffic information from smartphones has been integrated with other sources of data such as NetLogo simulated data (Thomas and Vidal, 2017) and airbag triggers data (Zaldivar et al., 2011) to detect accidents. These types of data rarely available, however, and they tend to be expensive to acquire (White et al., 2011). Other potential data sources include visual data such as photo and video. Several studies have utilized these data sources along with different techniques including matrix approximation (Xia et al., 2015), statistic heuristic method (Maaloul et al., 2017), hybrid support vector machine with extended Kalman filter (Vishnu and Nedunchezhian, 2018), and extreme learning machine (Chen et al., 2016b) to detect accidents. Vision-based accident detection requires a large amount of information provided by photos and videos, however, and it requires large storage capacity (Chen et al., 2016b). In addition, the accuracy of vision-based accident detection models is significantly affected by weather condition and picture/frame resolution (Xia et al., 2015). Social media is another new data source that can be used to detect accidents for instance through crawling, processing, and filtering tweets (Gu et al., 2016). As an example, Deep Belief Network (DBN) and Long Short-Term Memory (LSTM) are two deep learning models that have been used to detect accidents from Twitter content (Zhang et al., 2018). In these models, it is preferable to fuse social media with traffic data to achieve higher performance (Zhang and He, 2016). In fact, social media data are mostly considered as a supplement than a main source of data for accident detection since they tend to be less available than traffic data (Schulz et al., 2013). Moreover, the precision of the location of accidents reported in social media is generally poor (Zhang and He, 2016).

In the end, traditional traffic data offers a rich and relatively available source of data that can be used for accident detection (Amin and Jalil, 2012; Xu et al., 2016a). In particular, loop detector data is available for most highways and expressways in the US. Many techniques have been used to detect accidents using traffic data (e.g., conditional logistic regression (Kwak and Kho, 2016)) as well as dynamic methods that are able to detect shock waves caused by an accident to detect secondary accidents (Wang et al., 2016). Moreover, many machine learning models have been used to detect accidents, including k-nearest neighbor (Ozbayoglu et al., 2017), regression tree (Ozbayoglu et al., 2017), feed-forward neural network (Ozbayoglu et al., 2017), support vector machine (Dong et al., 2015), probabilistic neural network (Parsa et al., 2019c), dynamic Bayesian network (Sun and Sun, 2015), and deep learning (Chen et al., 2016a; Parsa et al., 2019a). To this end, traffic data is often considering to be best suited to detect and predict the occurrence of accidents (Xu et al., 2016b).

eXtreme Gradient Boosting (XGBoost) is a relatively new method, initially proposed by Chen and Guestrin in 2016 (Chen and Guestrin, 2016), that generally generates high accuracy and fast processing time while being computationally less costly and less complex (Chen and Guestrin, 2016; Hamilton et al., 2019). Meng (2018) has already leveraged XGBoost to predict the occurrence and the duration of accidents using several data sources including road geometric design, historical accident data, as well as traffic and weather data. Furthermore, two studies (Hamilton et al., 2019; Schlögl et al., 2019) showed that XGBoost performed better than several other machine learning techniques to predict the likelihood of an accident including Logistic Regression, Bayesian Regularized Neural Network, Pegasos SVM, Bagging Average Neural Networks, Deep Neural Network, and Gradient Boosting. Shan et al. (2018) also employed artificial neural network to integrate multiple XGBoost models together to predict accident duration. Finally, XGBoost has also been used to predict the severity of traffic accidents, and it achieved high performance, especially when using spatial data (Mokoatle et al., 2019).

In terms of model interpretation—which is especially important when using ML models that are often difficult to interpret—several studies have started to take advantage of SHAP (SHapley Additive exPlanation) (Ribeiro et al., 2016; Štrumbelj and Kononenko, 2014). SHAP was initially proposed by Shapley in 1953 and it is based on game theory (Shapley, 1953). It offers a powerful and insightful measure of the importance of a feature in a model. In 2017, Lundberg and Lee developed a practical package in Python that is able to calculate SHAP for different techniques including LightGBM, GBoost, CatBoost, XGBoost, and Scikit-learn tree models (Lundberg and Lee, 2017). Having known advantages of SHAP, researchers started to utilize this technique (Movahedi and Derrible, 2020). In traffic safety, Mihaita et al. (Mihaita et al., 2019) employed SHAP in 2019 to analyze the impact of different features on accident duration.

The main objective of this study is both to assess the performance of XGBoost to detect the occurrence of accidents in real time and to analyze the importance of individual features for accident detection using SHAP. This work includes traffic, network, demographic, land use, and weather data sources. The data pre-processing procedure adopted in this article is similar to Parsa et al. (2019b)—from the authors of this article—where we used Synthetic Minority Oversampling Technique (SMOTE) to prepare the dataset, and support vector machine and probabilistic neural network for accident detection modeling. The knowledge gained from the study can help urban planners to evaluate (Kashani et al., 2019) and inform policy decisions.

Semantically, we note that the terms variable and feature are identical. The former tends to be used in statistics and the latter tends to be used in computer science. In this article, we will use feature in line with most works that use XGBoost and SHAP.

The rest of this article is organized as follows. First, the data sources used in this study and the feature extraction procedure are presented. Then, XGBoost and SHAP are reviewed in depth in the methodology section. Finally, the performance of the accident detection model is analyzed and discussed, a comprehensive features interpretation is provided through SHAP.

Section snippets

Data analysis and feature extraction

The accident data used in this study were collected and archived by the Illinois Department of Transportation (IDOT). The dataset includes 244 accident cases that occurred on the Chicago metropolitan expressways between December 2016 and December 2017. Fig. 1 displays the location of these accidents. In addition, 6073 non-accident cases are selected randomly from the same time period. Each accident/non-accident case occurred on a section that has a loop detector located at the beginning and end

Methodology

In this study, XGBoost is used to model accident detection. XGBoost is an efficient implementation of gradient boosted decision trees. A Decision Tree (DT) has a structure similar to a tree with a root node (topmost node), internal nodes, and leaf nodes (end nodes). DT algorithms generally use simple rules to start from the root node and branch out, going through internal nodes, to finally end up in the leaves. In contrast, gradient boosted decision tree is an ensemble learning technique that

XGBoost model

In this work, XGBoost is trained on 65 % of the data that is selected randomly, and the remaining 35 % is used to test the model. In addition, a 10-fold cross-validation process is employed on the training data to test the stability of the model performance—that is, the training data is divided into ten subsamples randomly, and ten models are trained in such a way that each time nine subsamples are used to train a model and one subsample is used to test a model.

Fig. 2 displays the results of

Summary and conclusion

In this study, XGBoost is trained to model accident detection using a set of real-time data extracted and generated from different data sources. In total, 244 accident cases and 6073 non-accident cases are used to train the model that achieved an accuracy of 99 %, detection rate of 79 %, and false alarm rate of 0.16 %. Feature importance analysis is applied to the final model using SHAP, and traffic related features (especially speed) is found to have a substantial impact on the probability of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The research leading to these findings has received funding from the Illinois Department of Transportation (IDOT) and from the National Science Foundation (NSF) CAREER award #1551731. The authors would also like to thank IDOT for collecting and archiving the loop detectors data used in this study.

References (64)

R. Arvin et al.
How instantaneous driving behavior contributes to crashes at intersections: extracting useful information from connected vehicle message data
Accid. Anal. Prev.
(2019)
G. Azimi et al.
Severity analysis for large truck rollover crashes using a random parameter ordered logit model
Accident Analysis & Prevention
(2020)
C. Boulange et al.
Examining associations between urban design attributes and transport mode choice for walking, cycling, public transport and private motor vehicle trips
J. Transp. Health
(2017)
N. Dong et al.
Support vector machine in crash prediction at the level of traffic analysis zones: assessing the spatial proximity effects
Accid. Anal. Prev.
(2015)
B. Fernandes et al.
Automatic accident detection with multi-modal alert system implementation for ITS
Veh. Commun.
(2016)
N. Golshani et al.
Modeling travel mode and timing decisions: comparison of artificial neural networks and copula-based joint model
Travel Behav. Soc.
(2018)
Y. Gu et al.
From Twitter to detector: real-time traffic incident detection using social media data
Transp. Res. Part C
(2016)
H. Kashani et al.
An agent-based simulation model to evaluate the response to seismic retrofit promotion policies
International Journal of Disaster Risk Reduction
(2019)
H. Kwak et al.
Predicting crash risk and identifying crash precursors on Korean expressways using loop detector data
Accid. Anal. Prev.
(2016)
A.B. Parsa et al.
Real-time accident detection: coping with imbalanced data
Accid. Anal. Prev.
(2019)

Y. Chen et al.

A vision based traffic accident detection method using extreme learning machine

Int. Conf. Adv. Robot. Mechatronics

(2016)

D. Clark

Chicago Metropolitan Agency for Planning’s 2013 Land Use Inventory for Northeastern Illinois, Version 1.0. Chicago Metrop. Agency Plan

(2016)

J.H. Friedman

Greedy function approximation: a gradient boosting machine

Ann. Stat.

(2001)

Global status report on road safety 2015, 2015. World Heal....

B.A. Hamilton et al.

An eXtreme gradient boosting method for identifying the factors contributing to crash/near-crash events: a naturalistic driving study

Can. J. Civ. Eng.

(2019)

H. Han et al.

Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

Int. Conf. Intell. Comput.

(2005)

P. Kaur et al.

Comparing the behavior of oversampling and undersampling approach of class imbalance learning by combining class imbalance problem with noise

ICT Based Innov.

(2018)

Cited by (484)

Determining causality in travel mode choice
2024, Travel Behaviour and Society
This article presents one of the pioneering studies on causal modeling in travel mode choice decision-making using causal discovery algorithms. These models are a major advancement from conventional correlation-based techniques. We propose a novel methodology that combines causal discovery with structural equation modeling (SEM). This modeling approach overcomes some of the limitations of SEM by combining the strengths of both causal discovery and SEM. Causal discovery algorithms determine causal graphs from observational data and domain knowledge, and SEMs estimate direct causal effects and test the performance of causal discovery algorithms. In this study, we test four causal discovery algorithms: Peter-Clark (PC), Fast Causal Inference (FCI), Fast Greedy Equivalence Search (FGES), and Direct Linear Non-Gaussian Acyclic Models (DirectLiNGAM). The results show that DirectLiNGAM based SEM model best captures causality in mode choice behavior. It passes several goodness-of-fit tests, including Root Mean Square Error of Approximation (RMSEA) and Goodness-of-Fit Index (GFI), and it achieves the lowest Bayesian Information Criterion (BIC) value. The analyses are conducted on data collected from the 2017 National Household Travel Survey in the New York Metropolitan area.
Optimizing building spatial morphology to alleviate human thermal stress
2024, Sustainable Cities and Society
Achieving urban cooling from a sustainable perspective requires strategic planning of building area (S) and height (H). However, there is a lack of human thermal stress assessment and it is not clear how to optimize the layout of building spatial morphology to alleviate human thermal stress. We simulated the Universal Thermal Climate Index (UTCI), characterizing high spatial resolution human comfort, by machine learning, and analyzed the relationship between building spatial morphology and UTCI to determine the feasible layout of building spatial morphology. Our findings indicated that the study area experienced poor human thermal comfort, with residents facing high thermal stress (average UTCI of 36 °C). Zoning analysis revealed that an increase in S resulted in a simultaneous rise in UTCI, while an increase in H leaded to a trend of UTCI that initially rose and then declined. An increase in S-rating had a more pronounced impact on elevating UTCI (0.29 °C on average) compared to an increase in H-rating (0.11 °C on average). To maintain UTCI within the UTCI threshold that characterized ideal human comfort, a trade-off relationship between S and H should be maintained, which was further influenced by the stationary and plunge intervals in their relationship curve. The findings have the potential to provide valuable insights for policymakers and stakeholders, aiding them in making informed decisions in urban planning to alleviate human thermal stress.
An interpretable deep learning multi-dimensional integration framework for exchange rate forecasting based on deep and shallow feature selection and snapshot ensemble technology
2024, Engineering Applications of Artificial Intelligence
Accurate exchange rate forecasting is important for the better realization of international economic transactions and international currency investments. However, it is challenging to forecast exchange rates due to their high volatility, nonlinearity, and noisy characteristics. This paper creatively proposes a novel interpretable deep learning multi-dimensional integration framework for foreign exchange rate forecasting based on shallow and deep feature selection and snapshot ensemble technology. The main factors affecting exchange rates are considered at a deep and shallow level, and the feature selection project is implemented. A cluster-based multidimensional learning paradigm is proposed. The framework clusters data and constructs predictive sub-models and then embeds snapshot ensemble techniques into models to improve the robustness of single models. Finally, the results are nonlinearly integrated by a deep learning model. In addition, three datasets and twelve comparative models were used to demonstrate the performance of the models. The empirical results show that the proposed model has the smallest mean absolute percentage error (0.407859%, 0.47134%, and 0.470167%, respectively).
CO<inf>2</inf> fluxes contrast between aquaculture ponds and mangrove forests and its implications for coastal wetland rehabilitation in Leizhou Peninsula, China
2024, Agriculture, Ecosystems and Environment
Coastal wetlands have great potential to mitigate climate change. However, in recent decades, the construction of aquaculture ponds has degraded coastal habitats, especially mangroves. Meanwhile, the lack of synchronized monitoring of carbon dioxide (CO₂) fluxes in coastal ponds and mangroves hinders the promotion of carbon neutrality target through the implementation of coastal wetland rehabilitation. To address this issue, we deployed two eddy-covariance systems in the ponds and nearby mangroves to measure the net ecosystem CO₂ exchange between atmosphere and ecosystems for two years. The results indicated that both the ponds and mangroves were CO₂ sinks at the annual scale, with mean net ecosystem production (NEP) of 123 ± 39 and 1296 ± 32 g C m⁻² year⁻¹, respectively. During the 2-year period, the ponds acted as CO₂ sources in certain seasons, while the mangroves displayed consistently high seasonal NEP. The construction of ponds by clearing mangroves would reduce NEP along the coasts of the Leizhou Peninsula by about 91% (96% for China), while abandoning all ponds for mangrove rehabilitation could have a significant CO₂ mitigation benefit (i.e., 214.7 Gg year⁻¹). Moreover, we compared how CO₂ fluxes responded to global solar radiation and temperature by analyzing relevant parameters in the two ecosystems. Overall, the ponds showed lower light-saturated net CO₂ exchange and Q₁₀ values compared to the mangroves. Finally, we applied an advanced machine learning local interpretation algorithm to investigate the crucial drivers and their main effects on NEP. This analysis highlighted global solar radiation as the predominant driver for NEP in both ecosystems. High temperature and vapor pressure deficit inhibited mangrove NEP, particularly during summer, whereas pond NEP exhibited greater volatility in response to meteorological conditions such as temperature. Our findings provide insights for further proceeding with mangrove restoration and management to enhance the carbon sequestration capacity of coastal wetlands.
A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation
2024, Expert Systems with Applications
Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, and the evaluation of different combinations would enable a better understanding and guidance for different application domains. In this paper, we present a computational study to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 9 data augmentation and 9 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. We find that traditional data augmentation methods such as the synthetic minority oversampling technique (SMOTE) and random oversampling (ROS) are not only better in performance for selected CI problems, but also computationally less expensive than GANs. Our study is vital for the development of novel models for handling imbalanced datasets.
Spatiotemporal features of traffic help reduce automatic accident detection time[Formula presented]
2024, Expert Systems with Applications
Quick and reliable automatic detection of traffic accidents is of paramount importance to save human lives in transportation systems. However, automatically detecting when accidents occur has proven challenging, and minimizing the time to detect accidents (TTDA) by using traditional features in machine learning (ML) classifiers has plateaued. We hypothesize that accidents affect traffic farther from the accident location than previously reported. Therefore, leveraging traffic signatures from neighboring sensors that are adjacent to accidents should help improve their detection. We confirm this hypothesis by using verified ground-truth accident data, traffic data from radar detection system sensors, and light and weather conditions and show that we can minimize the TTDA while maximizing classification performance by considering spatiotemporal features of traffic. Specifically, we compare the performance of different ML classifiers (i.e, logistic regression, random forest, and XGBoost) when controlling for different numbers of neighboring sensors and TTDA horizons. We use data from interstates 75 and 24 in the metropolitan area that surrounds Chattanooga, TN. Our results show that the XGBoost classifier produces the best results by detecting accidents as quickly as 1.0 min after their occurrence with an area under the receiver operating characteristic curve of up to 83% and an average precision of up to 49%. We describe limitations, open challenges, and how the proposed framework can be used for quicker operational accident detection.

View all citing articles on Scopus

View full text

Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis

Highlights

Abstract

Introduction

Section snippets

Data analysis and feature extraction

Methodology

XGBoost model

Summary and conclusion

Declaration of Competing Interest

Acknowledgments

Accid. Anal. Prev.

Accident Analysis & Prevention

J. Transp. Health

Accid. Anal. Prev.

Veh. Commun.

Travel Behav. Soc.

Transp. Res. Part C

International Journal of Disaster Risk Reduction

Accid. Anal. Prev.

Accid. Anal. Prev.

Accid. Anal. Prev.

Accid. Anal. Prev.

Transp. Res. Part C

Saf. Sci.

Transp. Res. Part C

Transp. Res. Part C Emerg. Technol.

A Machine Learning Distracted Driving Prediction Model

International Symposium of Intelligent Unmanned Systems on Artificial Intelligence

Car accident detection and notification system using smartphone

Int. J. Comput. Sci. Mob. Comput.

Accident detection and reporting system using GPS, GPRS and GSM technology

Int. Conf. Informatics, Electron. Vis

The role of pre-crash driving instability in contributing to crash intensity using naturalistic driving data

Accident Analysis & Prevention

Investigation of heterogeneity in severity analysis for large truck crashes

98th Annu. Meet. Transp. Res. Board

Why feature correlation matters…. A lot! Towar

Data Sci.

SMOTE: synthetic minority over-sampling technique

J. Artif. Intell. Res.

Learning deep representation from big and heterogeneous data for traffic accident inference

30th AAAI Conf. Artif. Intell.

Xgboost: a scalable tree boosting system

Proc. 22nd acm sigkdd Int. Conf. Knowl. Discov. Data Min.

A vision based traffic accident detection method using extreme learning machine

Int. Conf. Adv. Robot. Mechatronics

Chicago Metropolitan Agency for Planning’s 2013 Land Use Inventory for Northeastern Illinois, Version 1.0. Chicago Metrop. Agency Plan

Greedy function approximation: a gradient boosting machine

Ann. Stat.

An eXtreme gradient boosting method for identifying the factors contributing to crash/near-crash events: a naturalistic driving study

Can. J. Civ. Eng.

Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

Int. Conf. Intell. Comput.

Comparing the behavior of oversampling and undersampling approach of class imbalance learning by combining class imbalance problem with noise

ICT Based Innov.