Generalized Extreme Value models for count data: Application to worker telecommuting frequency choices

doi:10.1016/j.trb.2015.11.008

Transportation Research Part B: Methodological

Volume 83, January 2016, Pages 104-120

https://doi.org/10.1016/j.trb.2015.11.008 Get rights and content

Highlights

•
Developed GEV models that subsume standard count models as special cases.
•
Examined the ability to retrieve the GEV model parameters using simulations.
•
Demonstrated the applicability of the GEV models for analyzing telecommuting choices.

Abstract

Count models are used for analyzing outcomes that can only take non-negative integer values with or without any pre-specified large upper limit. However, count models are typically considered to be different from random utility models such as the multinomial logit (MNL) model. In this paper, Generalized Extreme Value (GEV) models that are consistent with the Random Utility Maximization (RUM) framework and that subsume standard count models including Poisson, Geometric, Negative Binomial, Binomial, and Logarithmic models as special cases were developed. The ability of the Maximum Likelihood (ML) inference approach to retrieve the parameters of the resulting GEV count models was examined using synthetic data. The simulation results indicate that the ML estimation technique performs quite well in terms of recovering the true parameters of the proposed GEV count models. Also, the models developed were used to analyze the monthly telecommuting frequency decisions of workers. Overall, the empirical results demonstrate superior data fit and better predictive performance of the GEV models compared to standard count models.

Introduction

Count data are ubiquitous in empirical research. In the travel behavior context, count models have been used to analyze several key travel-related choices including household auto ownership (Zhao and Kockelman, 2002), telecommuting frequency (Singh et al., 2013), activity episode frequency choices (Bhat et al., 2015), and non-motorized mode usage counts (Smith and Kauermann, 2011). In transportation safety, count models are used to develop Safety Performance Functions (SPFs) that quantify the frequency of crash occurrences at any given location or region (Qin et al., 2005, Ahmed et al., 2011, Narayanamoorthy et al., 2013). In transportation geography, count models are used to examine the growth and decline of business establishments in a region over time (Manjón Antolín and Arauzo Carod, 2011, Bhat et al., 2014). Most of these studies used the Poisson or the Negative Binomial (NB) models (or their variants) for analyzing count data. The Poisson model assumes that the expected value and the variance of count data are equal. However, this is a very restrictive assumption and several studies in the past found evidence for variance exceeding expected value in certain empirical contexts. This property is referred to as ‘over-dispersion’ and the NB model is particularly suited for such scenarios. This added flexibility is because the NB model is a generalized version of Poisson model. To be specific, the NB model is a mixture of Poisson models in which the expected value of the Poisson model is gamma distributed (Greene, 2008). Although rare, binomial and logarithmic distributions were used for analyzing count data with ‘under-dispersion’, i.e., when the expected value is higher than the variance in the data (Winkelmann, 2013).

In some cases, deviations from probability mass assigned by the assumed discrete distribution (e.g., Poisson or NB) for specific count outcomes can be the primary reason for under or over dispersion in the data. The standard count models were modified in the past to deal with such situations. For instance, zero-inflated (ZI) and hurdle count models are generalizations of standard count models for handling over-representation of zeroes in count data (Gurmu, 1998, Lord et al., 2005). Specifically, the ZI and Hurdle count models are obtained by assuming a sequential decision making process. In the context of telecommuting, the ZI model assumes that the decision maker first chooses whether to exercise her/his option to telecommute, and then in the second stage chooses the number of days to telecommute (including the choice to not telecommute). In the hurdle model, on the other hand, the decision maker first chooses whether to telecommute or not and then in the second stage, chooses the number of non-zero days to telecommute. This two-stage decision making process also happens to fit the data better because these models provide additional flexibility to account for over-representation of zeroes in the dataset. But, similar behavioral interpretations for over-representation of non-zero count outcomes are not convincing (i.e., three stage sequential decision making for over-representation of both zeroes and ones in the data or multi-stage decision making for over-representation of multiple count outcomes in the data). Moreover, even mathematically, estimating inflated and hurdle models with multiple stages of decision making is not easy because the resulting model structure is not parsimonious.

Recently, Castro et al. (2012) proposed a Generalized Ordered Response (GOR) framework that subsumes standard count models as special cases to handle scenarios in which several count outcomes can have deviations from the probability mass implied by the underlying count probability distribution. Also, the GOR framework is befitting for analyzing multivariate count data because correlations among multiple count variables can be captured easily through the propensity equation instead of using common mixing terms in the expected value specification of correlated count data which is the standard norm (Narayanamoorthy et al., 2013). However, the GOR framework assumes a behavioral mechanism different from the Random Utility Maximization (RUM) principle (Bhat and Pulugurta, 1998). Specifically, while RUM models such as the multinomial logit (MNL) assume that latent utilities associated with different choice alternatives are translated into observed choice based on the utility maximization rule, the GOR models assume that a single latent propensity is translated into observed outcomes based on its value relative to threshold parameters. This is not to suggest that the GOR models are inconsistent with the RUM principle. In fact, Small (1984) showed that the GOR models can be recast as special cases of the MNL model with a specific form of non-linear utility specification. However, even OR models recast as MNL models assume a restrictive correlation structure for the underlying utilities. In cases when observed count data is an outcome of repeated discrete choices in a panel setting, researchers in the past linked count models to random utility models by using the maximum utility from the discrete choice model as an explanatory variable in the expected value specification of the count model (Burda et al., 2009, Bhat et al., 2015). For instance, worker daily out-of-home non-mandatory activity frequency can be viewed as an outcome of repeated activity-type choice decisions in a day. So, maximum utility from lower level discrete choice model for activity-type choice can be used as an explanatory variable in the expected value specification of total activity frequency model. While these models capture the linkages between the count model and the underlying discrete choices better, they retain the parametric discrete probability distribution assumption for the count model component.

Other attempts to link count models to utility framework have assumed a hierarchical decision making structure with an indefinite number of choice occasions (Daly, 1997, Daly and Miller, 2006, Ortuzar and Willumsen, 2011). For instance, in the context of telecommuting, each worker is assumed to decide whether to telecommute more days or stop at each hierarchical level – Level 1: 0 or 1+ days, Level 2: 1 or 2+ days; Level 3: 2 or 3+ days and so on. This model is referred to as the ‘frequency choice logit’ model and was shown to collapse to the geometric count model when the choice model for the first level (i.e., 0 or 1+ days) is assumed to be the same as the choice model for all subsequent levels. (Daly and Miller, 2006) conclude their study by noting that such links to utility theory for other count models such as the Poisson model are yet to be established.

So, the objectives of the current study are three-fold: (1) develop Generalized Extreme Value (GEV) models that are consistent with the RUM framework (McFadden, 1978) and subsume standard count models as special cases, including the frequency choice logit model developed by Daly and Co. This recasting of count models as GEV models not only provides additional flexibility for modeling count data but also offers considerable behavioral advantages. For instance, log-sum measures from count models can be computed and used as a measures of consumer surplus (Kohli and Daly, 2006, de Jong et al., 2007). Also, logsum from count model can be used as an explanatory variable in sequential estimation of multi-dimensional modeling of inter-dependent choices (Bowman and Ben-Akiva, 2001, Yao and Morikawa, 2005); (2) examine the ability of the Maximum Likelihood (ML) inference approach to retrieve the parameters of the proposed GEV models without bias using synthetic data; and (3) demonstrate the applicability of the GEV count models developed in this study in an empirical context of considerable importance to the transportation choice modeling community, namely worker telecommuting frequency decisions.

Section snippets

Methodological framework

There are primarily five different distributions that are used in the literature for modeling count data, namely Poisson, Geometric, Negative Binomial (NB), Binomial, and Logarithmic. This section demonstrates how each of these five models can be recast as special cases of the simplest GEV model, i.e., the multinomial logit (MNL) model. Before we proceed, please note that in the MNL model, the probability that alternative i with observed utility V_i is chosen from a set of J mutually exhaustive

Simulation analysis

The GEV versions of count models proposed in this study deal with infinite choice sets of ordinal outcomes. However, during model estimation, the choice set must be truncated at a pre-determined maximum count value. So, this is equivalent to misspecification of the choice set. The maximum likelihood (ML) inference approach may not produce consistent and unbiased estimates under misspecification (White, 1982). So, the bias and consistency of the OGEV count model parameter estimates obtained

Empirical application

With growing congestion levels and increasing budgetary constraints for new infrastructure projects, travel demand management strategies that lead to efficient use of available network capacity are becoming the norm for improving the travel conditions. Within the TDM strategies, those that focus on commuters such as telecommuting, staggered and flexible work hours, and parking subsidies are considered more effective because they directly impact peak period traffic. So, it would be useful to

Conclusions

Traditional count data models such as the Poisson and Negative Binomial (NB) models that assume a discrete probability distribution for count outcomes are commonly used in the literature. While generalizations that can handle over or under-representation of specific count outcomes (e.g., zero inflated or hurdle count models) were developed in the past, extending these methods for cases with over or under-representation of multiple count outcomes can result in complex model structures that may

Acknowledgments

The author would like to thank two anonymous reviewers whose comments helped improve an earlier version of the paper considerably.

References (34)

AhmedM. et al.
Exploring a Bayesian hierarchical approach for developing safety performance functions for a mountainous freeway
Accident Analysis & Prevention
(2011)
BhatC.R. et al.
A comparison of two alternative behavioral choice mechanisms for household auto ownership decisions
Transportation Research Part B
(1998)
BowmanJ.L. et al.
Activity-based disaggregate travel demand model system with activity schedules
Transportation Research Part A
(2001)
CastroM. et al.
A latent variable representation of count data models to accommodate spatial and temporal dependence: Application to predicting crash frequency at intersections
Transportation Research Part B
(2012)
de JongG. et al.
The logsum as an evaluation measure: Review of the literature and new results
Transportation Research Part A
(2007)
FosgerauM. et al.
Discrete choice models with multiplicative error terms
Transportation Research Part B
(2009)
GreeneW.
Functional forms for the negative binomial model for count data
Economics Letters
(2008)
GurmuS.
Generalized hurdle count data regression models
Economics Letters
(1998)
HausmanJ.A. et al.
Misclassification of the dependent variable in a discrete-response setting
Journal of Econometrics
(1998)
HausmanJ.A. et al.
A utility-consistent, combined discrete choice and count data model assessing recreational use losses due to natural resource damage
Journal of Public Economics
(1995)

LordD. et al.

Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory

Accident Analysis & Prevention

(2005)

NarayanamoorthyS. et al.

On accommodating spatial dependence in bicycle and pedestrian injury counts by severity level

Transportation Research Part B

(2013)

SmithM.S. et al.

Bicycle commuting in Melbourne during the 2000s energy crisis: A semiparametric analysis of intraday volumes

Transportation Research Part B

(2011)

YaoE. et al.

A study of on integrated intercity travel demand model

Transportation Research Part A

(2005)

AsgariH. et al.

Choice, frequency, and engagement: framework for telecommuting behavior analysis and modeling

Transportation Research Record: Journal of the Transportation Research Board

(2014)

BernardinoA. et al.

Modeling the process of adoption of telecommuting: comprehensive framework

Transportation Research Record: Journal of the Transportation Research Board

(1996)

BhatC.R. et al.

A new utility‐consistent econometric approach to multivariate count data modeling

Journal of Applied Econometrics

(2015)

Cited by (20)

Exploring the endogenous effects among car dependency, work arrangement choice, and daily travel using the 2017 NHTS data
2023, International Journal of Transportation Science and Technology
Citation Excerpt :
Telework replaces the need for in-person presence at traditional office space, thus removing (or at least relaxing) the spatial constraints associated with conventional work-place arrangements. There is a massive body of telecommute literature from adoption and frequency estimation (Tang et al. 2011, Sener and Bhat 2011, Singh et al. 2012, Asgari and Jin 2014, Paleti 2016) to investigating its impacts on activity patterns and daily travel (De Abreu e Silva and Melo, 2018a,b, Shabanpour et al. 2018, Kim 2017, Chakrabarti 2018, Lachapelle et al. 2018, Elldér 2020a). Recent research has documented complex cause-effect structures between work arrangements and other daily activity-travel behaviors (Zhu 2011, Mosa 2011, Asgari et al. 2016, Asgari and Jin 2017, Ben-Elia et al. 2018, Ben-Elia and Zhen 2018, Jamal and Habib 2020, Elldér 2020b, Ozbilen et al. 2021).
The paper presents an effort in investigating the cause-effect relationships among telecommuting, car dependency, and trip making behavior. Data from the 2017 National Household Travel Survey (NHTS) was used to develop a structural equations model (with latent constructs) to examine the endogenous effects between the decision to telework, the degree of car dependency, and the number of daily trips. Using confirmatory factor analysis, car dependency was measured by vehicle ownership, annual mileage, and car use frequency. The model provided interesting insights. Car dependency had a negative (both direct and total) impact on telework, an inference that confirms the role of car dependency as a resistant factor against alternative work arrangements and potentially any other new tech-based policies. In the opposite direction and surprisingly, telework encouraged car dependency, which might indicate that teleworkers tended to utilize the relaxed temporal-spatial constraints for non-work trips, and they were unlikely to give up private vehicle ownership simply because of their option to telecommute. Second, number of daily trips tended to increase the probability to telework, confirming the positive endogenous role of daily activity patterns on work arrangement decisions in a daily framework. Lastly, the significant negative impact of telework on daily trips was positively mediated through car dependency, leading to a smaller negative total impact on daily trips. Our study confirms that the association between telework and trip generation is quite a complex and multi-faceted relationship, leaving the argument regarding whether or not telework increases the number of daily trips as a debatable topic that calls for further exploration.
Estimation of crash type frequency accounting for misclassification in crash data
2023, Accident Analysis and Prevention
Crash misclassification (MC) – e.g., a crash of one type or severity being mistakenly miscategorized as another – is a relatively common problem in transportation safety. Crash frequency models for individual crash categories estimated using datasets with MC errors could result in biased parameter estimates and thus lead to ineffective countermeasure planning. This study proposes a novel methodological formulation to directly account for this MC error and incorporates it into the two most common count data models used for crash frequency prediction: Poisson and Negative Binomial (NB) regression. The proposed framework introduces probabilistic MC rates among different crash types and modifies the likelihood function of the count models accordingly. The paper also demonstrates how this approach can be integrated into reformulated models that express each count model as a discrete choice model. The capability of the proposed models to estimate true parameters, given the existence of MC error, is examined via simulation analysis. Then, the proposed models are applied to empirical data to examine the presence of MC in crash data and further examine the robustness of the proposed models. Although the MC rates are found to be very low in the empirical data, the fit of proposed models are found to be better compared to the models that ignore MC error and thus likely provide more reliable parameter estimates.
A work-life conflict perspective on telework
2020, Transportation Research Part A: Policy and Practice
Citation Excerpt :
Compared with tenants, homeowners are more likely to telework, which may be confounded with higher ownership rates of high-income households. Paleti (2016) found the opposite effects that homeowners were less likely to telework than tenants. Individuals in households with a higher employment rate (ratio of household workers to household size) are less likely to telework.
Telework has been promoted for decades as one of the traffic demand management policies to alleviate congestion during peak periods and reduce work-related trips, along with other benefits. However, less clear is the role played by life stages (i.e., gender, marital status and parenthood) on telework behavior. This study investigated to which extent telework frequency associated with life stages, and how these associations could be explained based on the work-life conflict perspective. Representative data were obtained from German Microcensus 2010 (N = 188,081 participants). The outcome variable was measured as ordered telework participation levels (i.e., never, infrequently and frequently). After testing for multicollinearity, a zero-inflated ordered probit regression model was applied to assess the associations between telework and family-life stages, while adjusting for individual, household, job-related and environmental characteristics. Results suggest that life stages associate with telework behavior in a complex way. Three patterns have been distinguished. Specifically, irrespective of gender and marital status, parents are less likely to telework compared to those without children. Regarding individuals without children, single individuals are more likely to telework than married ones, and males more likely than females. In contrast, for individuals with children, the partnered parents are more likely to telework than single parents, and females more likely than males. Our findings suggest that as the most important feature in family-life stages, children play a vital role in telework behavior. It not only increases both work-to-family conflict and family-to-work conflict, but also triggers housework re-division within couples and aggravates gender differences. Policies that support formal childcare resources could relieve the family-to-work conflict and encourage people to work at home.
Real-time prediction of public bike sharing system demand using generalized extreme value count model
2020, Transportation Research Part A: Policy and Practice
Citation Excerpt :
Recently, Generalized Extreme Value (GEV) count models that can easily handle probability mass deviations of multiple count outcomes from parametric count models were developed. Moreover, these GEV count models subsume the standard count models (including Poisson and NB) as special cases (Paleti, 2016). Each station has data corresponding to multiple days and multiple time-periods.
Public Bike Sharing Systems (BSSs) are becoming increasingly popular in recent times. Both the BSS operators and the customers can benefit from the large digital data portals that continuously record the state of the BSS. In this context, the current study developed generalized extreme value (GEV) count models that can predict hourly bike arrivals and departures at each station while accounting for time-of-day, weather, built environment, infrastructure, temporal, and spatial dependency factors. The proposed models were used to analyze the demand patterns in the Capital Bikeshare system and were found to predict the demand at both aggregate and disaggregate levels with reasonable accuracy. Specifically, the total demand in the entire system was predicted within 5% margin of error whereas 75% of the station-level arrival and departure predictions in the next one hour were within a margin of one from the observed counts. The proposed modeling system is useful (a) to BSS customers to better plan their travel based on expected bike and dock availability at the origin and destination ends of their BSS trips, and (b) to BSS operators to anticipate the future demand and optimize their rebalancing plans.
A mixed grouped response ordered logit count model framework
2018, Analytic Methods in Accident Research
Citation Excerpt :
It would be interesting to embed the proposed model structure within multivariate frameworks and compare the model performance with state of the art multivariate models including multivariate negative binomial or log-normal models, latent class flexible mixture multivariate model, multivariate models with spatial and temporal correlations, and recently proposed fractional split formulations (Mothafer et al., 2016; Heydari et al., 2017; Liu and Sharma, 2017; Liu and Sharma 2018; Yasmin and Eluru, 2018; Bhowmik et al., 2018) Finally, it would also be a useful exercise to compare the performance of the proposed approach in relation to the Generalized Extreme Value based count models proposed recently by Paleti (2016).
The study proposes and estimates a new econometric framework for analysing crash count events labeled as the Mixed Grouped Response Ordered Logit Count model. The proposed framework relates the crash count propensity to the observed counts directly while also accommodating for heteroscedasticity and unobserved heterogeneity. The proposed model is demonstrated by using Traffic Analysis Zone level bicycle crash count data for the Island of Montreal. The model framework employs a comprehensive set of exogenous variables − accessibility measures, exposure measures, built environment, road network characteristics, sociodemographic and socioeconomic characteristics. Further, we also compare the performance of the proposed model to the most commonly used negative binomial model and the generalized ordered logit count model by generating a comprehensive set of measures to evaluate model performance and data fit. The alternative modeling approaches considered for the comparison exercise include: (1) negative binomial model without parameterized overdispersion, (2) negative binomial model with parameterized overdispersion, and (3) mixed negative binomial model with parameterized overdispersion, (4) generalized ordered logit count model and (5) mixed generalized ordered logit count model, (6) grouped response ordered logit count model without parameterized variance, (7) grouped response ordered logit count model with parameterized variance and (8) mixed grouped response ordered logit count model with parameterized variance. The comparison exercise clearly highlights that the proposed mixed grouped response ordered logit count model with parameterized variance relative to the mixed negative binomial model with parameterized overdispersion offers either equivalent or superior data fit across various measures in the current study context. The fit measures for comparing the predictive performance also indicate that the proposed grouped response model offers better predictions both at the aggregate and disaggregate levels. Overall, the results from this comparison exercise points out that the grouped response ordered logit count model is a promising alternate econometric framework for examining crash count events.
Prediction of secondary crash frequency on highway networks
2017, Accident Analysis and Prevention
Citation Excerpt :
Lastly, random parameters generalized ordered response models that subsume standard count models as special cases and are suited for accommodating correlations across multivariate counts, and temporal and spatial dependency are used in the literature (Bhat et al., 2014; Castro et al., 2012; Narayanamoorthy et al., 2013). More recently, standard count models were recast as generalized extreme value (GEV) models such as the multinomial logit model (Paleti, 2016). These GEV models may be extended to mixed logit class of models to account for unobserved heterogeneity.
Secondary crash (SC) occurrences are major contributors to traffic delay and reduced safety, particularly in urban areas. National, state, and local agencies are investing substantial amount of resources to identify and mitigate secondary crashes to reduce congestion, related fatalities, injuries, and property damages. Though a relatively small portion of all crashes are secondary, determining the primary contributing factors for their occurrence is crucial. The non-recurring nature of SCs makes it imperative to predict their occurrences for effective incident management. In this context, the objective of this study is to develop prediction models to better understand causal factors inducing SCs. Given the count nature of secondary crash frequency data, the authors used count modeling methods including the standard Poisson and Negative Binomial (NB) models and their generalized variants to analyze secondary crash occurrences. Specifically, Generalized Ordered Response Probit (GORP) framework that subsumes standard count models as special cases and provides additional flexibility thus improving predictive accuracy were used in this study. The models developed account for possible effects of geometric design features, traffic composition and exposure, land use and other segment related attributes on frequency of SCs on freeways. The models were estimated using data from Shelby County, TN and results show that annual average daily traffic (AADT), traffic composition, land use, number of lanes, right side shoulder width, posted speed limits and ramp indicator are among key variables that effect SC occurrences. Also, the elasticity effects of these different factors were also computed to quantify their magnitude of impact.

View all citing articles on Scopus

View full text

Generalized Extreme Value models for count data: Application to worker telecommuting frequency choices

Highlights

Abstract

Introduction

Section snippets

Methodological framework

Simulation analysis

Empirical application

Conclusions

Acknowledgments

Accident Analysis & Prevention

Transportation Research Part B

Transportation Research Part A

Transportation Research Part B

Transportation Research Part A

Transportation Research Part B

Economics Letters

Economics Letters

Journal of Econometrics

Journal of Public Economics

Accident Analysis & Prevention

Transportation Research Part B

Transportation Research Part B

Transportation Research Part A

Choice, frequency, and engagement: framework for telecommuting behavior analysis and modeling

Transportation Research Record: Journal of the Transportation Research Board

Modeling the process of adoption of telecommuting: comprehensive framework

Transportation Research Record: Journal of the Transportation Research Board

A new utility‐consistent econometric approach to multivariate count data modeling

Journal of Applied Econometrics