Generalized Extreme Value models for count data: Application to worker telecommuting frequency choices

https://doi.org/10.1016/j.trb.2015.11.008Get rights and content

Highlights

  • Developed GEV models that subsume standard count models as special cases.

  • Examined the ability to retrieve the GEV model parameters using simulations.

  • Demonstrated the applicability of the GEV models for analyzing telecommuting choices.

Abstract

Count models are used for analyzing outcomes that can only take non-negative integer values with or without any pre-specified large upper limit. However, count models are typically considered to be different from random utility models such as the multinomial logit (MNL) model. In this paper, Generalized Extreme Value (GEV) models that are consistent with the Random Utility Maximization (RUM) framework and that subsume standard count models including Poisson, Geometric, Negative Binomial, Binomial, and Logarithmic models as special cases were developed. The ability of the Maximum Likelihood (ML) inference approach to retrieve the parameters of the resulting GEV count models was examined using synthetic data. The simulation results indicate that the ML estimation technique performs quite well in terms of recovering the true parameters of the proposed GEV count models. Also, the models developed were used to analyze the monthly telecommuting frequency decisions of workers. Overall, the empirical results demonstrate superior data fit and better predictive performance of the GEV models compared to standard count models.

Introduction

Count data are ubiquitous in empirical research. In the travel behavior context, count models have been used to analyze several key travel-related choices including household auto ownership (Zhao and Kockelman, 2002), telecommuting frequency (Singh et al., 2013), activity episode frequency choices (Bhat et al., 2015), and non-motorized mode usage counts (Smith and Kauermann, 2011). In transportation safety, count models are used to develop Safety Performance Functions (SPFs) that quantify the frequency of crash occurrences at any given location or region (Qin et al., 2005, Ahmed et al., 2011, Narayanamoorthy et al., 2013). In transportation geography, count models are used to examine the growth and decline of business establishments in a region over time (Manjón Antolín and Arauzo Carod, 2011, Bhat et al., 2014). Most of these studies used the Poisson or the Negative Binomial (NB) models (or their variants) for analyzing count data. The Poisson model assumes that the expected value and the variance of count data are equal. However, this is a very restrictive assumption and several studies in the past found evidence for variance exceeding expected value in certain empirical contexts. This property is referred to as ‘over-dispersion’ and the NB model is particularly suited for such scenarios. This added flexibility is because the NB model is a generalized version of Poisson model. To be specific, the NB model is a mixture of Poisson models in which the expected value of the Poisson model is gamma distributed (Greene, 2008). Although rare, binomial and logarithmic distributions were used for analyzing count data with ‘under-dispersion’, i.e., when the expected value is higher than the variance in the data (Winkelmann, 2013).

In some cases, deviations from probability mass assigned by the assumed discrete distribution (e.g., Poisson or NB) for specific count outcomes can be the primary reason for under or over dispersion in the data. The standard count models were modified in the past to deal with such situations. For instance, zero-inflated (ZI) and hurdle count models are generalizations of standard count models for handling over-representation of zeroes in count data (Gurmu, 1998, Lord et al., 2005). Specifically, the ZI and Hurdle count models are obtained by assuming a sequential decision making process. In the context of telecommuting, the ZI model assumes that the decision maker first chooses whether to exercise her/his option to telecommute, and then in the second stage chooses the number of days to telecommute (including the choice to not telecommute). In the hurdle model, on the other hand, the decision maker first chooses whether to telecommute or not and then in the second stage, chooses the number of non-zero days to telecommute. This two-stage decision making process also happens to fit the data better because these models provide additional flexibility to account for over-representation of zeroes in the dataset. But, similar behavioral interpretations for over-representation of non-zero count outcomes are not convincing (i.e., three stage sequential decision making for over-representation of both zeroes and ones in the data or multi-stage decision making for over-representation of multiple count outcomes in the data). Moreover, even mathematically, estimating inflated and hurdle models with multiple stages of decision making is not easy because the resulting model structure is not parsimonious.

Recently, Castro et al. (2012) proposed a Generalized Ordered Response (GOR) framework that subsumes standard count models as special cases to handle scenarios in which several count outcomes can have deviations from the probability mass implied by the underlying count probability distribution. Also, the GOR framework is befitting for analyzing multivariate count data because correlations among multiple count variables can be captured easily through the propensity equation instead of using common mixing terms in the expected value specification of correlated count data which is the standard norm (Narayanamoorthy et al., 2013). However, the GOR framework assumes a behavioral mechanism different from the Random Utility Maximization (RUM) principle (Bhat and Pulugurta, 1998). Specifically, while RUM models such as the multinomial logit (MNL) assume that latent utilities associated with different choice alternatives are translated into observed choice based on the utility maximization rule, the GOR models assume that a single latent propensity is translated into observed outcomes based on its value relative to threshold parameters. This is not to suggest that the GOR models are inconsistent with the RUM principle. In fact, Small (1984) showed that the GOR models can be recast as special cases of the MNL model with a specific form of non-linear utility specification. However, even OR models recast as MNL models assume a restrictive correlation structure for the underlying utilities. In cases when observed count data is an outcome of repeated discrete choices in a panel setting, researchers in the past linked count models to random utility models by using the maximum utility from the discrete choice model as an explanatory variable in the expected value specification of the count model (Burda et al., 2009, Bhat et al., 2015). For instance, worker daily out-of-home non-mandatory activity frequency can be viewed as an outcome of repeated activity-type choice decisions in a day. So, maximum utility from lower level discrete choice model for activity-type choice can be used as an explanatory variable in the expected value specification of total activity frequency model. While these models capture the linkages between the count model and the underlying discrete choices better, they retain the parametric discrete probability distribution assumption for the count model component.

Other attempts to link count models to utility framework have assumed a hierarchical decision making structure with an indefinite number of choice occasions (Daly, 1997, Daly and Miller, 2006, Ortuzar and Willumsen, 2011). For instance, in the context of telecommuting, each worker is assumed to decide whether to telecommute more days or stop at each hierarchical level – Level 1: 0 or 1+ days, Level 2: 1 or 2+ days; Level 3: 2 or 3+ days and so on. This model is referred to as the ‘frequency choice logit’ model and was shown to collapse to the geometric count model when the choice model for the first level (i.e., 0 or 1+ days) is assumed to be the same as the choice model for all subsequent levels. (Daly and Miller, 2006) conclude their study by noting that such links to utility theory for other count models such as the Poisson model are yet to be established.

So, the objectives of the current study are three-fold: (1) develop Generalized Extreme Value (GEV) models that are consistent with the RUM framework (McFadden, 1978) and subsume standard count models as special cases, including the frequency choice logit model developed by Daly and Co. This recasting of count models as GEV models not only provides additional flexibility for modeling count data but also offers considerable behavioral advantages. For instance, log-sum measures from count models can be computed and used as a measures of consumer surplus (Kohli and Daly, 2006, de Jong et al., 2007). Also, logsum from count model can be used as an explanatory variable in sequential estimation of multi-dimensional modeling of inter-dependent choices (Bowman and Ben-Akiva, 2001, Yao and Morikawa, 2005); (2) examine the ability of the Maximum Likelihood (ML) inference approach to retrieve the parameters of the proposed GEV models without bias using synthetic data; and (3) demonstrate the applicability of the GEV count models developed in this study in an empirical context of considerable importance to the transportation choice modeling community, namely worker telecommuting frequency decisions.

Section snippets

Methodological framework

There are primarily five different distributions that are used in the literature for modeling count data, namely Poisson, Geometric, Negative Binomial (NB), Binomial, and Logarithmic. This section demonstrates how each of these five models can be recast as special cases of the simplest GEV model, i.e., the multinomial logit (MNL) model. Before we proceed, please note that in the MNL model, the probability that alternative i with observed utility Vi is chosen from a set of J mutually exhaustive

Simulation analysis

The GEV versions of count models proposed in this study deal with infinite choice sets of ordinal outcomes. However, during model estimation, the choice set must be truncated at a pre-determined maximum count value. So, this is equivalent to misspecification of the choice set. The maximum likelihood (ML) inference approach may not produce consistent and unbiased estimates under misspecification (White, 1982). So, the bias and consistency of the OGEV count model parameter estimates obtained

Empirical application

With growing congestion levels and increasing budgetary constraints for new infrastructure projects, travel demand management strategies that lead to efficient use of available network capacity are becoming the norm for improving the travel conditions. Within the TDM strategies, those that focus on commuters such as telecommuting, staggered and flexible work hours, and parking subsidies are considered more effective because they directly impact peak period traffic. So, it would be useful to

Conclusions

Traditional count data models such as the Poisson and Negative Binomial (NB) models that assume a discrete probability distribution for count outcomes are commonly used in the literature. While generalizations that can handle over or under-representation of specific count outcomes (e.g., zero inflated or hurdle count models) were developed in the past, extending these methods for cases with over or under-representation of multiple count outcomes can result in complex model structures that may

Acknowledgments

The author would like to thank two anonymous reviewers whose comments helped improve an earlier version of the paper considerably.

References (34)

  • LordD. et al.

    Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory

    Accident Analysis & Prevention

    (2005)
  • NarayanamoorthyS. et al.

    On accommodating spatial dependence in bicycle and pedestrian injury counts by severity level

    Transportation Research Part B

    (2013)
  • SmithM.S. et al.

    Bicycle commuting in Melbourne during the 2000s energy crisis: A semiparametric analysis of intraday volumes

    Transportation Research Part B

    (2011)
  • YaoE. et al.

    A study of on integrated intercity travel demand model

    Transportation Research Part A

    (2005)
  • AsgariH. et al.

    Choice, frequency, and engagement: framework for telecommuting behavior analysis and modeling

    Transportation Research Record: Journal of the Transportation Research Board

    (2014)
  • BernardinoA. et al.

    Modeling the process of adoption of telecommuting: comprehensive framework

    Transportation Research Record: Journal of the Transportation Research Board

    (1996)
  • BhatC.R. et al.

    A new utility‐consistent econometric approach to multivariate count data modeling

    Journal of Applied Econometrics

    (2015)
  • Cited by (20)

    • Exploring the endogenous effects among car dependency, work arrangement choice, and daily travel using the 2017 NHTS data

      2023, International Journal of Transportation Science and Technology
      Citation Excerpt :

      Telework replaces the need for in-person presence at traditional office space, thus removing (or at least relaxing) the spatial constraints associated with conventional work-place arrangements. There is a massive body of telecommute literature from adoption and frequency estimation (Tang et al. 2011, Sener and Bhat 2011, Singh et al. 2012, Asgari and Jin 2014, Paleti 2016) to investigating its impacts on activity patterns and daily travel (De Abreu e Silva and Melo, 2018a,b, Shabanpour et al. 2018, Kim 2017, Chakrabarti 2018, Lachapelle et al. 2018, Elldér 2020a). Recent research has documented complex cause-effect structures between work arrangements and other daily activity-travel behaviors (Zhu 2011, Mosa 2011, Asgari et al. 2016, Asgari and Jin 2017, Ben-Elia et al. 2018, Ben-Elia and Zhen 2018, Jamal and Habib 2020, Elldér 2020b, Ozbilen et al. 2021).

    • A work-life conflict perspective on telework

      2020, Transportation Research Part A: Policy and Practice
      Citation Excerpt :

      Compared with tenants, homeowners are more likely to telework, which may be confounded with higher ownership rates of high-income households. Paleti (2016) found the opposite effects that homeowners were less likely to telework than tenants. Individuals in households with a higher employment rate (ratio of household workers to household size) are less likely to telework.

    • Real-time prediction of public bike sharing system demand using generalized extreme value count model

      2020, Transportation Research Part A: Policy and Practice
      Citation Excerpt :

      Recently, Generalized Extreme Value (GEV) count models that can easily handle probability mass deviations of multiple count outcomes from parametric count models were developed. Moreover, these GEV count models subsume the standard count models (including Poisson and NB) as special cases (Paleti, 2016). Each station has data corresponding to multiple days and multiple time-periods.

    • A mixed grouped response ordered logit count model framework

      2018, Analytic Methods in Accident Research
      Citation Excerpt :

      It would be interesting to embed the proposed model structure within multivariate frameworks and compare the model performance with state of the art multivariate models including multivariate negative binomial or log-normal models, latent class flexible mixture multivariate model, multivariate models with spatial and temporal correlations, and recently proposed fractional split formulations (Mothafer et al., 2016; Heydari et al., 2017; Liu and Sharma, 2017; Liu and Sharma 2018; Yasmin and Eluru, 2018; Bhowmik et al., 2018) Finally, it would also be a useful exercise to compare the performance of the proposed approach in relation to the Generalized Extreme Value based count models proposed recently by Paleti (2016).

    • Prediction of secondary crash frequency on highway networks

      2017, Accident Analysis and Prevention
      Citation Excerpt :

      Lastly, random parameters generalized ordered response models that subsume standard count models as special cases and are suited for accommodating correlations across multivariate counts, and temporal and spatial dependency are used in the literature (Bhat et al., 2014; Castro et al., 2012; Narayanamoorthy et al., 2013). More recently, standard count models were recast as generalized extreme value (GEV) models such as the multinomial logit model (Paleti, 2016). These GEV models may be extended to mixed logit class of models to account for unobserved heterogeneity.

    View all citing articles on Scopus
    View full text