Elsevier

Safety Science

Volume 120, December 2019, Pages 157-163
Safety Science

Penalized Conway-Maxwell-Poisson regression for modelling dispersed discrete data: The case study of motor vehicle crash frequency

https://doi.org/10.1016/j.ssci.2019.06.036Get rights and content

Highlights

  • This is original manuscript highlights one of the critical issues related to the modeling of road safety measures.

  • The manuscript introduce an improved version of the COM-Poisson regression model for road safety modeling.

  • The manuscript presented the performance of the proposed regression model using a real example data.

Abstract

Statistical modelling of road crashes has been of extreme interest to researchers over the last decades. Such models are necessary for the investigation of the opportunities for road safety improvement. The motor vehicle crash frequency (MVC-F) is probably the most important count of road crashes. In practice, like many of other discrete variables, this count is often diagnosed with over- or underdispersion, i.e. the variance is greater or less than the mean. The traditional regression models, especially those based on the Poisson distribution, are inefficient in modelling dispersed count data. On the contrary, the Conway-Maxwell-Poisson (COM-Poisson) distribution has been proven powerful in modelling count data with a wide range of dispersion. In crash data modelling, many situations may give rise to collinearity between contributory crash factors. Under this situation, the maximum likelihood estimates of the coefficients of the COM-Poisson GLM become increasingly unreliable as the collinearity among the model predictors increases. This paper addresses this issue and proposes a penalized likelihood scheme to be used with the COM-Poisson GLM regression for improving its prediction performance. For better GLM regression output, we suggest implementing the penalized COM-Poisson GLM regression under a K- fold cross-validation framework. A real-world crash example is provided, showing the performance of the penalized COM-Poisson GLM regression compared to the Poisson and the classical COM-Poisson GLM regressions.

Introduction

Statistical models for road crash data have become of extreme importance for all aspects of transport and traffic systems (TTS). These models, in particular, are helpful in establishing an evidence-based framework through which transportation authority can develop innovative road safety strategies and interventions. Over the past years, there has been a growing interest in literature on the development of statistical models for road safety studies (Lord and Persaud, 2000, Lord et al., 2008, Lord and Mannering, 2010, Kokonendji, 2014, Frank and Samia, 2015, Paraskevi et al., 2015, Abdella et al., 2016, Abdur Rouf et al., 2018).

The Poisson distributions are the most popular for road safety management and analysis. The motor vehicle crash frequency (MVC-F) is probably the most common crash count variable (Nazha et al., 2018). Since the MVC-F is usually characterized by over- and underdispersion, the use of the traditional Poisson-based models becomes inappropriate. The over-dispersion phenomenon usually occurs when the variance of the crash count exceeds the expected mean, i.e. VY>E(Y). On the contrary, the under-dispersion occurs when the variance is less than the expected mean, i.e. VY<E(Y). The over-dispersion in the MVC-F count usually occurs when the dataset on some road segments or time-intervals exhibits a large number of zero values (Vangala et al., 2015). Several significant contributions have been conducted over the last years on the applications of the Conway-Maxwell-Poisson (COM-Poisson) in modeling over- and underdispersion count variables. Guikema and Coffelt (2008) developed a COM-Poisson GLM formulation using two different links (dual-link) functions for modeling discrete counts. These two functions are lnμ=β0+i=1pβixi and lnϑ=φ0+j=1qφizi, where p and q are the covariates used in the location and the shape link functions. Lord et al. (2008) modified the dual-link formulation by replacing the second link function by the value of the shape parameterϑ. The authors used logγ1/ϑ to link the response variable with the model predictors and then suggested using the Bayesian method for estimating the coefficients of the resulted model. The selection of the Bayesian estimation here is mainly attributed to its favourable mathematical and practical properties. The study has shown that the COM-Poisson GLMs perform the same as the NB models. One advantage of the COM-Poisson GLM over the NB model is that it can handle both the over- and underdispersed count data (Lord et al., 2008). This feature would, for sure, expands the domain of usage of the COM-Poisson-based models to include a wide range of industrial and service sectors. Sellers and Shmueli, (2010) considered the log linear form of the Poisson regression by McCullagh and Nelder, (1997) and proposed a COM-Poisson GLM formulation using a η(EY)=log(γ) as a link function. This function is often preferable over others for it would lead to a direct estimation of the coefficients of the GLM regressors. Both of the above studies are mainly based on the assumption of having a fixed value of the shape parameter, i.e. the shape parameter is independent of the covariates of the model. Sellers and Andrew (2016) developed another format of the COM-Poisson, named as a zero-inflated COM–Poisson (ZICMP), having the capability of modelling both over- and under-dispersion dataset. The ZICMP has revealed a performance relatively close to the zero-inflated negative binomial (ZINB) with respect to the Akaike Information Criterion (AIC).

In the practice of road crash modeling, there are many situations may give rise to multicollinearity –or simply collinearity between contributory crash factors. In statistics, the collinearity refers to the existence of a significant linear relationship between the model factors. The collinearity has a severe impact on the reliability of the GLM estimates as well as the interpretation of the results. Despite the attractive prediction performance of the COM-Poisson-based regression models, we believe that there is still a need for developing COM-Poisson regression models that are capable of not only accommodating the dispersion issue but also being insensitive to the potential collinearity between road crash factors.

This work develops a penalized likelihood formulation to be used in the COM-Poisson GLM regression to stabilizing the estimates of the regression parameters. We develop the penalized likelihood function by adding an L2 norm to the squared norm of the GLM coefficients in the likelihood function by Sellers and Shmueli (2010). Considering the effectiveness of the COM-Poisson distribution in modelling dispersed count data, our penalized COM-Poisson GLM regression would provide a practitioner with a powerful tool for simultaneously overcoming both the “dispersion” and “collinearity” problems. In order to achieve a better predictive performance of our model, we implement the penalized COM-Poisson GLM regression under a K fold-cross-validation framework.

We organize the rest of this research paper as follows: Section 2 introduces the COM-Poisson distribution. Section 3 describes the penalized COM-Poisson GLM regression model. Section 4 describes the cross-validation approach. Section 5 illustrates the implementation of the penalized COM-Poisson GLM for modeling a real-world example. Section 6 presents a summary of the findings and outlines extensions for future research work.

Section snippets

The COM-Poisson distribution

The COM-Poisson is usually considered as a general form of the Poisson distribution. Conway and Maxwell, (1962) firstly develop the COM-Poisson for analyzing queuing systems. Later, its statistical properties are studied and reported by Shmueli et al., (2005). The probability mass function (pmf) of the COM-Poisson distribution is as follows:PrYi=yi\γi,ϑ=fyi;γi,ϑ=γiyiQ(γi,ϑ)(yi!)ϑ;yi=0,1,.;i=1,2,.nwhere yiRn is the ith discrete response value, γi is the mean value of the observations (γi>0)

Methodology: the penalized COM-Poisson GLM

This section introduces the COM-Poisson GLM reported by Sellers & Shmueli, (2010) and shows the steps followed in this work to develop its penalized version. The GLM using η(EY)=log(γ) is denoted as follows:logEYi=logγ=Xi'β=β0+β1Xi1+β2Xi2++βpXip,where XijRn is the ith observation of the jthmodel predictor. The values βj are the coefficients of the model predictor in the GLM. There are two well-known techniques for estimating the coefficients of the GLM shown in Eq. (6). These are the weighted

Penalized likelihood estimation based on the cross-validation

The L2 penalization may cause some bias. The proper selection of the penalization parameter (λ) reduces the negative effect of large bias and results in a better mean squared error. Hoerl and Kennard, (1970) firstly proposed this parameter when the ridge regression is introduced as a solution for the collinearity problem. Since then, several methods have been developed to find the most efficient estimate of this parameter (Kibria, 2003, Khalaf and Shukur, 2005, Alkhamisi et al., 2006, Alkhamisi

An illustrative example: the MVC-F in Toronto, Ontario

In this section, we compare the prediction accuracy of the penalized COM-Poisson GLM with both the Poisson and the classical COM-Poisson GLM. For a fair comparison, this paper uses the overdispersed crash dataset collected in 1995 at 868 signalized intersections located in Toronto, Ontario. Several researchers have used this dataset (Lord, 2000, Miaou and Lord, 2003, Miranda-Moreno and Fu, 2007, Lord et al., 2008, Sellers and Shmueli, 2010). For a further detailed description of this dataset,

Optimizing the K-parameter

The selection of K is critical to the quality of the cross-validation. In Section 5, we used K = 4. The results have shown the advantage of the penalized-COM Poisson over its counterparts the classical COM-Poisson and the Poisson regression. In this section, we conduct an analytical study to investigate the performance of the Penalized-COM-Poisson under several values of K.

Three different values of K = {2,6,8} were suggested. Table 7 shows the descriptive analysis of the sub-dataset under each

Conclusion

This paper integrates the penalized likelihood estimation, more specifically the ridge penalty function, with the COM-Poisson GLM regression to enhance its prediction accuracy under the condition of the collinearity. Using the penalized likelihood estimation makes the COM-Poisson GLM regression more rigorous to inflation in the standard errors of GLM estimates. The real-world example illustrated the excellent performance of the penalized COM-Poisson GLM regression in simultaneously

Acknowledgment

Authors are grateful to Dominique Lord (Texas A&M University) and Srinivas Geedipally (Texas A&M Transportation Institute) for providing us the dataset of the real-example of Toronto, Ontario.

References (32)

  • R.W. Conway et al.

    A queuing model with state-dependent service rates

    J. Ind. Eng.

    (1962)
  • G. Frank et al.

    Determinants of seat belt use: a regression analysis with FARS data corrected for self-selection

    J. Saf. Res.

    (2015)
  • Nazha R. Ghadban et al.

    Analyzing the impact of human characteristics on the comprehensibility of road traffic signs

    The Proceedings of the International Conference on Industrial Engineering and Operations Management, Bandung, Indonesia, March 6–8, 2018

    (2018)
  • C.S. Göbl et al.

    Application of penalized regression techniques in modelling insulin sensitivity by correlated metabolic parameters

    PLoS ONE

    (2015)
  • J.J. Goeman

    L1 penalized estimation in the cox proportional hazards model

    Biometr. J.

    (2010)
  • S.D. Guikema et al.

    A flexible count data regression model for risk analysis

    Risk Anal.

    (2007)
  • Cited by (31)

    • Derivation of the Empirical Bayesian method for the Negative Binomial-Lindley generalized linear model with application in traffic safety

      2022, Accident Analysis and Prevention
      Citation Excerpt :

      Different statistical models have been proposed by safety researchers to overcome limitations of the NB model. Poisson log-normal (Song et al., 2006; Park and Lord, 2007; Khazraee et al., 2018; Shirazi and Lord, 2019), Poisson-generalized inverse Gaussian (Zha et al., 2016; Zou et al., 2013), Conway-Maxwell-Poisson (Lord et al., 2010; Abdella et al., 2019), Semiparametric NB model (Shirazi et al., 2016), Poisson-Tweedie (Debrabant et al., 2018; Saha et al., 2020), Generalized Additive Models (Xie and Zhang, 2008), and Negative Binomial-Lindley (NB-L) (Zamani and Ismail, 2010; Lord and Geedipally, 2011; Geedipally et al., 2012; Shirazi et al., 2017; Shaon et al., 2018; Khodadadi et al., 2021) are just a few examples of advanced count models developed over time to overcome or alleviate the limitations of the NB model. NB-L in particular is the subject of interest in this study.

    • A random parameters with heterogeneity in means and Lindley approach to analyze crash data with excessive zeros: A case study of head-on heavy vehicle crashes in Queensland

      2021, Accident Analysis and Prevention
      Citation Excerpt :

      This review is followed by estimation methods and performance measures are discussed. Maximum likelihood estimation (Veeramisti et al., 2019; Abdella et al., 2019; Yu et al., 2020; Ash et al., 2020; Yu et al., 2019b) and Bayesian methods (Ahmed et al., 2018; Cheng et al., 2017; Li et al., 2019; Guo et al., 2019; Liu and Sharma, 2018) are the most popular for the estimation of crash-frequency models. Contrary to maximum likelihood estimation, Bayesian inference estimates the parameters from posterior distributions, and can handle models with complex hierarchical structures (Han et al., 2018).

    • A mixed model-based Johnson's relative weights for eco-efficiency assessment: The case for global food consumption

      2021, Environmental Impact Assessment Review
      Citation Excerpt :

      The maximum likelihood (MLE) is the most common method for fitting the LMMs. The MLE provides estimates for regression coefficients and fixed and random effects (Oyeyemi et al., 2015; Kim et al., 2019; Abdella et al., 2019a). The restricted maximum likelihood (RMLE) is another method for fitting the LMMs.

    View all citing articles on Scopus
    View full text