Penalized Conway-Maxwell-Poisson regression for modelling dispersed discrete data: The case study of motor vehicle crash frequency
Introduction
Statistical models for road crash data have become of extreme importance for all aspects of transport and traffic systems (TTS). These models, in particular, are helpful in establishing an evidence-based framework through which transportation authority can develop innovative road safety strategies and interventions. Over the past years, there has been a growing interest in literature on the development of statistical models for road safety studies (Lord and Persaud, 2000, Lord et al., 2008, Lord and Mannering, 2010, Kokonendji, 2014, Frank and Samia, 2015, Paraskevi et al., 2015, Abdella et al., 2016, Abdur Rouf et al., 2018).
The Poisson distributions are the most popular for road safety management and analysis. The motor vehicle crash frequency (MVC-F) is probably the most common crash count variable (Nazha et al., 2018). Since the MVC-F is usually characterized by over- and underdispersion, the use of the traditional Poisson-based models becomes inappropriate. The over-dispersion phenomenon usually occurs when the variance of the crash count exceeds the expected mean, i.e. . On the contrary, the under-dispersion occurs when the variance is less than the expected mean, i.e. . The over-dispersion in the MVC-F count usually occurs when the dataset on some road segments or time-intervals exhibits a large number of zero values (Vangala et al., 2015). Several significant contributions have been conducted over the last years on the applications of the Conway-Maxwell-Poisson (COM-Poisson) in modeling over- and underdispersion count variables. Guikema and Coffelt (2008) developed a COM-Poisson GLM formulation using two different links (dual-link) functions for modeling discrete counts. These two functions are and , where and are the covariates used in the location and the shape link functions. Lord et al. (2008) modified the dual-link formulation by replacing the second link function by the value of the shape parameter. The authors used to link the response variable with the model predictors and then suggested using the Bayesian method for estimating the coefficients of the resulted model. The selection of the Bayesian estimation here is mainly attributed to its favourable mathematical and practical properties. The study has shown that the COM-Poisson GLMs perform the same as the NB models. One advantage of the COM-Poisson GLM over the NB model is that it can handle both the over- and underdispersed count data (Lord et al., 2008). This feature would, for sure, expands the domain of usage of the COM-Poisson-based models to include a wide range of industrial and service sectors. Sellers and Shmueli, (2010) considered the log linear form of the Poisson regression by McCullagh and Nelder, (1997) and proposed a COM-Poisson GLM formulation using a as a link function. This function is often preferable over others for it would lead to a direct estimation of the coefficients of the GLM regressors. Both of the above studies are mainly based on the assumption of having a fixed value of the shape parameter, i.e. the shape parameter is independent of the covariates of the model. Sellers and Andrew (2016) developed another format of the COM-Poisson, named as a zero-inflated COM–Poisson (ZICMP), having the capability of modelling both over- and under-dispersion dataset. The ZICMP has revealed a performance relatively close to the zero-inflated negative binomial (ZINB) with respect to the Akaike Information Criterion (AIC).
In the practice of road crash modeling, there are many situations may give rise to multicollinearity –or simply collinearity between contributory crash factors. In statistics, the collinearity refers to the existence of a significant linear relationship between the model factors. The collinearity has a severe impact on the reliability of the GLM estimates as well as the interpretation of the results. Despite the attractive prediction performance of the COM-Poisson-based regression models, we believe that there is still a need for developing COM-Poisson regression models that are capable of not only accommodating the dispersion issue but also being insensitive to the potential collinearity between road crash factors.
This work develops a penalized likelihood formulation to be used in the COM-Poisson GLM regression to stabilizing the estimates of the regression parameters. We develop the penalized likelihood function by adding an norm to the squared norm of the GLM coefficients in the likelihood function by Sellers and Shmueli (2010). Considering the effectiveness of the COM-Poisson distribution in modelling dispersed count data, our penalized COM-Poisson GLM regression would provide a practitioner with a powerful tool for simultaneously overcoming both the “dispersion” and “collinearity” problems. In order to achieve a better predictive performance of our model, we implement the penalized COM-Poisson GLM regression under a K fold-cross-validation framework.
We organize the rest of this research paper as follows: Section 2 introduces the COM-Poisson distribution. Section 3 describes the penalized COM-Poisson GLM regression model. Section 4 describes the cross-validation approach. Section 5 illustrates the implementation of the penalized COM-Poisson GLM for modeling a real-world example. Section 6 presents a summary of the findings and outlines extensions for future research work.
Section snippets
The COM-Poisson distribution
The COM-Poisson is usually considered as a general form of the Poisson distribution. Conway and Maxwell, (1962) firstly develop the COM-Poisson for analyzing queuing systems. Later, its statistical properties are studied and reported by Shmueli et al., (2005). The probability mass function (pmf) of the COM-Poisson distribution is as follows:where is the th discrete response value, is the mean value of the observations (
Methodology: the penalized COM-Poisson GLM
This section introduces the COM-Poisson GLM reported by Sellers & Shmueli, (2010) and shows the steps followed in this work to develop its penalized version. The GLM using is denoted as follows:where is the observation of the model predictor. The values are the coefficients of the model predictor in the GLM. There are two well-known techniques for estimating the coefficients of the GLM shown in Eq. (6). These are the weighted
Penalized likelihood estimation based on the cross-validation
The penalization may cause some bias. The proper selection of the penalization parameter () reduces the negative effect of large bias and results in a better mean squared error. Hoerl and Kennard, (1970) firstly proposed this parameter when the ridge regression is introduced as a solution for the collinearity problem. Since then, several methods have been developed to find the most efficient estimate of this parameter (Kibria, 2003, Khalaf and Shukur, 2005, Alkhamisi et al., 2006, Alkhamisi
An illustrative example: the MVC-F in Toronto, Ontario
In this section, we compare the prediction accuracy of the penalized COM-Poisson GLM with both the Poisson and the classical COM-Poisson GLM. For a fair comparison, this paper uses the overdispersed crash dataset collected in 1995 at 868 signalized intersections located in Toronto, Ontario. Several researchers have used this dataset (Lord, 2000, Miaou and Lord, 2003, Miranda-Moreno and Fu, 2007, Lord et al., 2008, Sellers and Shmueli, 2010). For a further detailed description of this dataset,
Optimizing the K-parameter
The selection of K is critical to the quality of the cross-validation. In Section 5, we used K = 4. The results have shown the advantage of the penalized-COM Poisson over its counterparts the classical COM-Poisson and the Poisson regression. In this section, we conduct an analytical study to investigate the performance of the Penalized-COM-Poisson under several values of K.
Three different values of K = {2,6,8} were suggested. Table 7 shows the descriptive analysis of the sub-dataset under each
Conclusion
This paper integrates the penalized likelihood estimation, more specifically the ridge penalty function, with the COM-Poisson GLM regression to enhance its prediction accuracy under the condition of the collinearity. Using the penalized likelihood estimation makes the COM-Poisson GLM regression more rigorous to inflation in the standard errors of GLM estimates. The real-world example illustrated the excellent performance of the penalized COM-Poisson GLM regression in simultaneously
Acknowledgment
Authors are grateful to Dominique Lord (Texas A&M University) and Srinivas Geedipally (Texas A&M Transportation Institute) for providing us the dataset of the real-example of Toronto, Ontario.
References (32)
- et al.
The statistical analysis of crash-frequency data: a review and assessment of methodological alternatives
Transp. Res. Part A
(2010) - et al.
Application of the Conway–Maxwell–Poisson generalized linear model for analyzing motor vehicle crashes
Accid. Anal. Prev.
(2008) - et al.
A flexible zero-inflated model to address data dispersion
Comput. Stat. Data Anal.
(2016) - et al.
Exploring the application of the Negative Binomial – Generalized Exponential model for analyzing traffic crash data with excess zeros
Anal. Meth. Accid. Res.
(2015) - et al.
Usage of non-linear regression for modeling the behavior of motor vehicle crash fatality (MVF) rate
- et al.
Modelling trends in road crash frequency in Qatar State
Int. J. Operat. Res.
(2019) - et al.
Ridge penalization-based generalized linear model (GzLM) for predicting risky-driving index
- et al.
Developing ridge parameters for sur model
Commun. Stat. Theory Meth.
(2008) - et al.
Some modifications for choosing ridge parameter
Commun. Stat. Theory Meth.
(2006) - et al.
Penalized maximum likelihood estimator for normal mixtures
Board Found. Scand. J. Stat.
(2003)
A queuing model with state-dependent service rates
J. Ind. Eng.
Determinants of seat belt use: a regression analysis with FARS data corrected for self-selection
J. Saf. Res.
Analyzing the impact of human characteristics on the comprehensibility of road traffic signs
The Proceedings of the International Conference on Industrial Engineering and Operations Management, Bandung, Indonesia, March 6–8, 2018
Application of penalized regression techniques in modelling insulin sensitivity by correlated metabolic parameters
PLoS ONE
L1 penalized estimation in the cox proportional hazards model
Biometr. J.
A flexible count data regression model for risk analysis
Risk Anal.
Cited by (31)
Urban resilience and livability performance of European smart cities: A novel machine learning approach
2022, Journal of Cleaner ProductionTemporal impacts of road safety interventions: A structural-shifts-based method for road accident mortality analysis
2022, Accident Analysis and PreventionDerivation of the Empirical Bayesian method for the Negative Binomial-Lindley generalized linear model with application in traffic safety
2022, Accident Analysis and PreventionCitation Excerpt :Different statistical models have been proposed by safety researchers to overcome limitations of the NB model. Poisson log-normal (Song et al., 2006; Park and Lord, 2007; Khazraee et al., 2018; Shirazi and Lord, 2019), Poisson-generalized inverse Gaussian (Zha et al., 2016; Zou et al., 2013), Conway-Maxwell-Poisson (Lord et al., 2010; Abdella et al., 2019), Semiparametric NB model (Shirazi et al., 2016), Poisson-Tweedie (Debrabant et al., 2018; Saha et al., 2020), Generalized Additive Models (Xie and Zhang, 2008), and Negative Binomial-Lindley (NB-L) (Zamani and Ismail, 2010; Lord and Geedipally, 2011; Geedipally et al., 2012; Shirazi et al., 2017; Shaon et al., 2018; Khodadadi et al., 2021) are just a few examples of advanced count models developed over time to overcome or alleviate the limitations of the NB model. NB-L in particular is the subject of interest in this study.
A random parameters with heterogeneity in means and Lindley approach to analyze crash data with excessive zeros: A case study of head-on heavy vehicle crashes in Queensland
2021, Accident Analysis and PreventionCitation Excerpt :This review is followed by estimation methods and performance measures are discussed. Maximum likelihood estimation (Veeramisti et al., 2019; Abdella et al., 2019; Yu et al., 2020; Ash et al., 2020; Yu et al., 2019b) and Bayesian methods (Ahmed et al., 2018; Cheng et al., 2017; Li et al., 2019; Guo et al., 2019; Liu and Sharma, 2018) are the most popular for the estimation of crash-frequency models. Contrary to maximum likelihood estimation, Bayesian inference estimates the parameters from posterior distributions, and can handle models with complex hierarchical structures (Han et al., 2018).
A mixed model-based Johnson's relative weights for eco-efficiency assessment: The case for global food consumption
2021, Environmental Impact Assessment ReviewCitation Excerpt :The maximum likelihood (MLE) is the most common method for fitting the LMMs. The MLE provides estimates for regression coefficients and fixed and random effects (Oyeyemi et al., 2015; Kim et al., 2019; Abdella et al., 2019a). The restricted maximum likelihood (RMLE) is another method for fitting the LMMs.