Generalized Extreme Value models for count data: Application to worker telecommuting frequency choices
Introduction
Count data are ubiquitous in empirical research. In the travel behavior context, count models have been used to analyze several key travel-related choices including household auto ownership (Zhao and Kockelman, 2002), telecommuting frequency (Singh et al., 2013), activity episode frequency choices (Bhat et al., 2015), and non-motorized mode usage counts (Smith and Kauermann, 2011). In transportation safety, count models are used to develop Safety Performance Functions (SPFs) that quantify the frequency of crash occurrences at any given location or region (Qin et al., 2005, Ahmed et al., 2011, Narayanamoorthy et al., 2013). In transportation geography, count models are used to examine the growth and decline of business establishments in a region over time (Manjón Antolín and Arauzo Carod, 2011, Bhat et al., 2014). Most of these studies used the Poisson or the Negative Binomial (NB) models (or their variants) for analyzing count data. The Poisson model assumes that the expected value and the variance of count data are equal. However, this is a very restrictive assumption and several studies in the past found evidence for variance exceeding expected value in certain empirical contexts. This property is referred to as ‘over-dispersion’ and the NB model is particularly suited for such scenarios. This added flexibility is because the NB model is a generalized version of Poisson model. To be specific, the NB model is a mixture of Poisson models in which the expected value of the Poisson model is gamma distributed (Greene, 2008). Although rare, binomial and logarithmic distributions were used for analyzing count data with ‘under-dispersion’, i.e., when the expected value is higher than the variance in the data (Winkelmann, 2013).
In some cases, deviations from probability mass assigned by the assumed discrete distribution (e.g., Poisson or NB) for specific count outcomes can be the primary reason for under or over dispersion in the data. The standard count models were modified in the past to deal with such situations. For instance, zero-inflated (ZI) and hurdle count models are generalizations of standard count models for handling over-representation of zeroes in count data (Gurmu, 1998, Lord et al., 2005). Specifically, the ZI and Hurdle count models are obtained by assuming a sequential decision making process. In the context of telecommuting, the ZI model assumes that the decision maker first chooses whether to exercise her/his option to telecommute, and then in the second stage chooses the number of days to telecommute (including the choice to not telecommute). In the hurdle model, on the other hand, the decision maker first chooses whether to telecommute or not and then in the second stage, chooses the number of non-zero days to telecommute. This two-stage decision making process also happens to fit the data better because these models provide additional flexibility to account for over-representation of zeroes in the dataset. But, similar behavioral interpretations for over-representation of non-zero count outcomes are not convincing (i.e., three stage sequential decision making for over-representation of both zeroes and ones in the data or multi-stage decision making for over-representation of multiple count outcomes in the data). Moreover, even mathematically, estimating inflated and hurdle models with multiple stages of decision making is not easy because the resulting model structure is not parsimonious.
Recently, Castro et al. (2012) proposed a Generalized Ordered Response (GOR) framework that subsumes standard count models as special cases to handle scenarios in which several count outcomes can have deviations from the probability mass implied by the underlying count probability distribution. Also, the GOR framework is befitting for analyzing multivariate count data because correlations among multiple count variables can be captured easily through the propensity equation instead of using common mixing terms in the expected value specification of correlated count data which is the standard norm (Narayanamoorthy et al., 2013). However, the GOR framework assumes a behavioral mechanism different from the Random Utility Maximization (RUM) principle (Bhat and Pulugurta, 1998). Specifically, while RUM models such as the multinomial logit (MNL) assume that latent utilities associated with different choice alternatives are translated into observed choice based on the utility maximization rule, the GOR models assume that a single latent propensity is translated into observed outcomes based on its value relative to threshold parameters. This is not to suggest that the GOR models are inconsistent with the RUM principle. In fact, Small (1984) showed that the GOR models can be recast as special cases of the MNL model with a specific form of non-linear utility specification. However, even OR models recast as MNL models assume a restrictive correlation structure for the underlying utilities. In cases when observed count data is an outcome of repeated discrete choices in a panel setting, researchers in the past linked count models to random utility models by using the maximum utility from the discrete choice model as an explanatory variable in the expected value specification of the count model (Burda et al., 2009, Bhat et al., 2015). For instance, worker daily out-of-home non-mandatory activity frequency can be viewed as an outcome of repeated activity-type choice decisions in a day. So, maximum utility from lower level discrete choice model for activity-type choice can be used as an explanatory variable in the expected value specification of total activity frequency model. While these models capture the linkages between the count model and the underlying discrete choices better, they retain the parametric discrete probability distribution assumption for the count model component.
Other attempts to link count models to utility framework have assumed a hierarchical decision making structure with an indefinite number of choice occasions (Daly, 1997, Daly and Miller, 2006, Ortuzar and Willumsen, 2011). For instance, in the context of telecommuting, each worker is assumed to decide whether to telecommute more days or stop at each hierarchical level – Level 1: 0 or 1+ days, Level 2: 1 or 2+ days; Level 3: 2 or 3+ days and so on. This model is referred to as the ‘frequency choice logit’ model and was shown to collapse to the geometric count model when the choice model for the first level (i.e., 0 or 1+ days) is assumed to be the same as the choice model for all subsequent levels. (Daly and Miller, 2006) conclude their study by noting that such links to utility theory for other count models such as the Poisson model are yet to be established.
So, the objectives of the current study are three-fold: (1) develop Generalized Extreme Value (GEV) models that are consistent with the RUM framework (McFadden, 1978) and subsume standard count models as special cases, including the frequency choice logit model developed by Daly and Co. This recasting of count models as GEV models not only provides additional flexibility for modeling count data but also offers considerable behavioral advantages. For instance, log-sum measures from count models can be computed and used as a measures of consumer surplus (Kohli and Daly, 2006, de Jong et al., 2007). Also, logsum from count model can be used as an explanatory variable in sequential estimation of multi-dimensional modeling of inter-dependent choices (Bowman and Ben-Akiva, 2001, Yao and Morikawa, 2005); (2) examine the ability of the Maximum Likelihood (ML) inference approach to retrieve the parameters of the proposed GEV models without bias using synthetic data; and (3) demonstrate the applicability of the GEV count models developed in this study in an empirical context of considerable importance to the transportation choice modeling community, namely worker telecommuting frequency decisions.
Section snippets
Methodological framework
There are primarily five different distributions that are used in the literature for modeling count data, namely Poisson, Geometric, Negative Binomial (NB), Binomial, and Logarithmic. This section demonstrates how each of these five models can be recast as special cases of the simplest GEV model, i.e., the multinomial logit (MNL) model. Before we proceed, please note that in the MNL model, the probability that alternative i with observed utility Vi is chosen from a set of J mutually exhaustive
Simulation analysis
The GEV versions of count models proposed in this study deal with infinite choice sets of ordinal outcomes. However, during model estimation, the choice set must be truncated at a pre-determined maximum count value. So, this is equivalent to misspecification of the choice set. The maximum likelihood (ML) inference approach may not produce consistent and unbiased estimates under misspecification (White, 1982). So, the bias and consistency of the OGEV count model parameter estimates obtained
Empirical application
With growing congestion levels and increasing budgetary constraints for new infrastructure projects, travel demand management strategies that lead to efficient use of available network capacity are becoming the norm for improving the travel conditions. Within the TDM strategies, those that focus on commuters such as telecommuting, staggered and flexible work hours, and parking subsidies are considered more effective because they directly impact peak period traffic. So, it would be useful to
Conclusions
Traditional count data models such as the Poisson and Negative Binomial (NB) models that assume a discrete probability distribution for count outcomes are commonly used in the literature. While generalizations that can handle over or under-representation of specific count outcomes (e.g., zero inflated or hurdle count models) were developed in the past, extending these methods for cases with over or under-representation of multiple count outcomes can result in complex model structures that may
Acknowledgments
The author would like to thank two anonymous reviewers whose comments helped improve an earlier version of the paper considerably.
References (34)
- et al.
Exploring a Bayesian hierarchical approach for developing safety performance functions for a mountainous freeway
Accident Analysis & Prevention
(2011) - et al.
A comparison of two alternative behavioral choice mechanisms for household auto ownership decisions
Transportation Research Part B
(1998) - et al.
Activity-based disaggregate travel demand model system with activity schedules
Transportation Research Part A
(2001) - et al.
A latent variable representation of count data models to accommodate spatial and temporal dependence: Application to predicting crash frequency at intersections
Transportation Research Part B
(2012) - et al.
The logsum as an evaluation measure: Review of the literature and new results
Transportation Research Part A
(2007) - et al.
Discrete choice models with multiplicative error terms
Transportation Research Part B
(2009) Functional forms for the negative binomial model for count data
Economics Letters
(2008)Generalized hurdle count data regression models
Economics Letters
(1998)- et al.
Misclassification of the dependent variable in a discrete-response setting
Journal of Econometrics
(1998) - et al.
A utility-consistent, combined discrete choice and count data model assessing recreational use losses due to natural resource damage
Journal of Public Economics
(1995)
Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory
Accident Analysis & Prevention
On accommodating spatial dependence in bicycle and pedestrian injury counts by severity level
Transportation Research Part B
Bicycle commuting in Melbourne during the 2000s energy crisis: A semiparametric analysis of intraday volumes
Transportation Research Part B
A study of on integrated intercity travel demand model
Transportation Research Part A
Choice, frequency, and engagement: framework for telecommuting behavior analysis and modeling
Transportation Research Record: Journal of the Transportation Research Board
Modeling the process of adoption of telecommuting: comprehensive framework
Transportation Research Record: Journal of the Transportation Research Board
A new utility‐consistent econometric approach to multivariate count data modeling
Journal of Applied Econometrics
Cited by (20)
Exploring the endogenous effects among car dependency, work arrangement choice, and daily travel using the 2017 NHTS data
2023, International Journal of Transportation Science and TechnologyCitation Excerpt :Telework replaces the need for in-person presence at traditional office space, thus removing (or at least relaxing) the spatial constraints associated with conventional work-place arrangements. There is a massive body of telecommute literature from adoption and frequency estimation (Tang et al. 2011, Sener and Bhat 2011, Singh et al. 2012, Asgari and Jin 2014, Paleti 2016) to investigating its impacts on activity patterns and daily travel (De Abreu e Silva and Melo, 2018a,b, Shabanpour et al. 2018, Kim 2017, Chakrabarti 2018, Lachapelle et al. 2018, Elldér 2020a). Recent research has documented complex cause-effect structures between work arrangements and other daily activity-travel behaviors (Zhu 2011, Mosa 2011, Asgari et al. 2016, Asgari and Jin 2017, Ben-Elia et al. 2018, Ben-Elia and Zhen 2018, Jamal and Habib 2020, Elldér 2020b, Ozbilen et al. 2021).
Estimation of crash type frequency accounting for misclassification in crash data
2023, Accident Analysis and PreventionA work-life conflict perspective on telework
2020, Transportation Research Part A: Policy and PracticeCitation Excerpt :Compared with tenants, homeowners are more likely to telework, which may be confounded with higher ownership rates of high-income households. Paleti (2016) found the opposite effects that homeowners were less likely to telework than tenants. Individuals in households with a higher employment rate (ratio of household workers to household size) are less likely to telework.
Real-time prediction of public bike sharing system demand using generalized extreme value count model
2020, Transportation Research Part A: Policy and PracticeCitation Excerpt :Recently, Generalized Extreme Value (GEV) count models that can easily handle probability mass deviations of multiple count outcomes from parametric count models were developed. Moreover, these GEV count models subsume the standard count models (including Poisson and NB) as special cases (Paleti, 2016). Each station has data corresponding to multiple days and multiple time-periods.
A mixed grouped response ordered logit count model framework
2018, Analytic Methods in Accident ResearchCitation Excerpt :It would be interesting to embed the proposed model structure within multivariate frameworks and compare the model performance with state of the art multivariate models including multivariate negative binomial or log-normal models, latent class flexible mixture multivariate model, multivariate models with spatial and temporal correlations, and recently proposed fractional split formulations (Mothafer et al., 2016; Heydari et al., 2017; Liu and Sharma, 2017; Liu and Sharma 2018; Yasmin and Eluru, 2018; Bhowmik et al., 2018) Finally, it would also be a useful exercise to compare the performance of the proposed approach in relation to the Generalized Extreme Value based count models proposed recently by Paleti (2016).
Prediction of secondary crash frequency on highway networks
2017, Accident Analysis and PreventionCitation Excerpt :Lastly, random parameters generalized ordered response models that subsume standard count models as special cases and are suited for accommodating correlations across multivariate counts, and temporal and spatial dependency are used in the literature (Bhat et al., 2014; Castro et al., 2012; Narayanamoorthy et al., 2013). More recently, standard count models were recast as generalized extreme value (GEV) models such as the multinomial logit model (Paleti, 2016). These GEV models may be extended to mixed logit class of models to account for unobserved heterogeneity.