When is big data big enough? Implications of using GPS-based surveys for travel demand analysis
Introduction
Traditional models of individual and household travel and activity behavior are estimated using travel diary datasets that ask a small subset of the population of interest to record over a period of one or two days which activities were conducted where, when, for how long, with whom and using what mode of travel. For example, the New York Best Practices Model (NYBPM), the activity-based model of travel demand developed for the New York Metropolitan Region, was estimated using travel diary data from about 28,000 individuals collected over a period of one day (Parsons Brinkerhoff Quade & Douglas, Inc., 2005). The size of that and other similar travel diary datasets pales in comparison to the volume of information that can potentially be retrieved from new technologies, such as Global Positioning System (GPS) sensors and smartphones, and social media platforms, such as Twitter and Facebook, both now and in the future.
Advances in GPS technologies in particular have received substantive attention in the last decade (for an exhaustive review of the literature, the reader is referred to Zmud et al., 2013, Shen and Stopher, 2014, and Wolf et al. (2014)). Early applications sought to supplement extant methods of travel diary data collection that rely on self-reporting, such as mail-back, phone-based or door-to-door travel diary surveys, through their ability to control for factors such as trip underreporting (Pierce et al., 2003, Wolf et al., 2003, Zmud and Wolf, 2003, Bricka and Bhat, 2007, Bachman et al., 2012). A number of regional household travel surveys have since adopted GPS-based surveys as a subcomponent to improve the accuracy and completeness of data collected using these more traditional methods. Recent examples include the Atlanta Regional Travel Survey, conducted in 2011, and the California Household Travel Survey, conducted in 2012 and 2013.
However, the long-term objective has always been the development of GPS-based surveys that can collect all the information that is usually collected by traditional travel diary surveys, but with very little input from survey participants. Thus far, there have been three large-scale travel surveys that have relied entirely on GPS sensors for the collection of travel diary data. The Ohio Department of Transportation initiated the first exclusively GPS household travel survey, conducted in the Cincinnati metropolitan area over 2009 and 2010 with a sample of 2583 households (Stopher et al., 2012), and followed it with the Northeast Ohio Regional Travel Survey conducted over 2012 and 2013 in the Cleveland metropolitan area with a sample of 4545 households. The regional planning agency in Jerusalem administered a similar exclusively GPS household travel survey over 2010 and 2011 with a sample of 8800 households. As the technology matures, GPS-based surveys are expected eventually to replace traditional travel diary surveys (Wolf et al., 2001, Stopher et al., 2008).
The advantages of using GPS-based surveys are manifold. They impose fewer requirements on survey respondents, offer greater spatiotemporal precision and are potentially cheaper to implement. Nevertheless, as pointed out by Shen and Stopher (2014), GPS-based surveys “cannot record travel mode, trip purpose or the number of occupants in a private vehicle—all important attributes in a traditional travel survey. Therefore, data-processing procedures become critical to the usefulness of GPS surveys, because there would be insufficient information for travel modeling purposes without the results of the processing.” Existing GPS-based surveys ask a subset of the sample population to participate in prompted-recall surveys where the respondents are asked to confirm trip details. The data thus collected is used to infer the same trip details for the remainder of the sample population. Numerous algorithms have been proposed for inferring one or more of these missing details from the GPS data, augmented in many cases with additional sources, such as accelerometer readings from smartphones (e.g. Feng and Timmermans, 2013), land use characteristics from Geographic Information Systems (GIS) databases (e.g. Bohte and Maat, 2009) or ‘check-ins’ from social media applications (e.g. Hasan and Ukkusuri, 2014). However, even the most successful inference algorithm will have some error associated with it. Most published studies in the literature report average accuracies of 60–90%. For example, the best practices travel mode inference model recommended by the National Cooperative Highway Research Program (NCHRP) Report 775, titled ‘Applying GPS Data to Understand Travel Behavior,’ has an average accuracy of 82% (Wolf et al., 2014). Errors in inference can potentially compromise the quality of data collected through GPS-based surveys and the validity of travel demand models developed using this data. And yet, to the best of our knowledge, no study has systematically examined the implications of using low-quality big data for traditional modes of analyses.
The objectives of this study are twofold: (1) to evaluate the impact of errors in inference on travel demand models estimated with GPS data using simulated and real datasets; and (2) to develop ways in which these errors can be controlled for through data collection methods and estimation procedures. To be clear, we do not wish to compare the performance of different inference algorithms. There are numerous studies in the literature that have already done so (refer to Shen and Stopher (2014) or Wolf et al. (2014) for a comprehensive list). Rather, our objective is to determine, for a given inference algorithm with a measured level of accuracy and precision that is used to impute missing information from GPS-based surveys, the validity of travel demand models estimated using this imputed data.
The paper is structured as follows: Section 2 describes a series of Monte Carlo experiments that compare model performance across different sample sizes, inference accuracies, model complexities and estimation methods; Section 3 uses actual GPS data collected from individuals residing in the San Francisco Bay Area, United States to corroborate findings from the Monte Carlo experiments; and Section 4 concludes the paper with a summary of findings and implications.
Section snippets
Monte Carlo experiment
In this section, we simulate a series of three Monte Carlo experiments to assess the impact of inference errors on estimation results as a function of the accuracy of the inference algorithm, the complexity of the desired travel demand model specification, the sample size of the observed data and the estimation method used to recover parameter estimates. A Monte Carlo experiment is especially useful because the true parameters underlying the data generating process are known, and the impact of
Case study: GPS-based survey in the San Francisco Bay Area, United States
In this section, we corroborate findings from Section 2 using real data collected from smartphone users living in the San Francisco Bay Area, United States through the means of an app called E-Mission. The app is being developed by a team of researchers at the University of California (UC), Berkeley. One of the objectives of E-Mission is to collect all the information that is usually collected by travel diary surveys, but with minimal input from the smartphone user. For more details about the
Conclusions
The last few years have been witness to great excitement over big data and its potential to address a multitude of societal problems, within transportation engineering and without, on an unprecedented scale and level of detail. The National Science Foundation’s call for research proposals in 2014 on “Critical Techniques and Technologies for Advancing Big Data Science and Engineering”, the Transportation Research Board’s call for papers on “Big Data, ICTs, and Travel Demand Models” for its 93rd
Acknowledgements
This research was funded in part by the Jim Gray Fellowship and in part by the NSF ActionWebs CPS-0931843. Jim Gray, who did pioneering work on the management of large amounts of data, disappeared while sailing in the San Francisco Bay in 2007. We hope that he would have found this exploration into data sizes and accuracies interesting. We wish to thank Mogeng Yin, Shanthi Shanmugam and Ryan Lei for their help in developing E-Mission, the smartphone app used by this study for data collection.
References (30)
- et al.
Deriving and validating trip purposes and travel modes for multi-day GPS-based travel surveys: a large-scale application in the Netherlands
Transport. Res. Part C: Emerg. Technol.
(2009) - et al.
Inferring hybrid transportation modes from sparse GPS data using a moving window SVM classification
Comput. Environ. Urban Syst.
(2012) - et al.
Transportation mode recognition using GPS and accelerometer data
Transport. Res. Part C: Emerg. Technol.
(2013) - et al.
Urban activity pattern classification using topic models from online geo-location data
Transport. Res. Part C: Emerg. Technol.
(2014) - et al.
Modeling the structural relationships among short-distance travel amounts, perceptions, affections, and desires
Transport. Res. Part A: Policy Pract.
(2009) - et al.
Search for a global positioning system device to measure person travel
Transport. Res. Part C: Emerg. Technol.
(2008) - et al.
Behavioural theories of dispersion and the misspecification of travel demand models
Transport. Res. Part B: Methodol.
(1982) - Bachman, W., Oliveira, M., Xu, J., Sabina, E., 2012. Using household-level GPS travel data to measure regional traffic...
- et al.(1985)
- et al.
Comparative analysis of Global Positioning System-based and travel survey-based data
Transport. Res. Rec.: J. Transport. Res. Board
(2007)
Economic choices
Am. Econ. Rev.
Cited by (49)
An open-source interactive travel diary for web-based trip reporting
2024, Transportation Research ProcediaDetection and analysis of transfer time in urban rail transit system using WIFI data
2023, Transportation LettersDeriving transport appraisal values from emerging revealed preference data
2022, Transportation Research Part A: Policy and PracticeA bi-partite generative model framework for analyzing and simulating large scale multiple discrete-continuous travel behaviour data
2020, Transportation Research Part C: Emerging TechnologiesDeveloping a passive GPS tracking system to study long-term travel behavior
2019, Transportation Research Part C: Emerging TechnologiesCitation Excerpt :The use of GPS data, for instance provided by smartphones, can significantly improve the efficiency of travel diary studies and the quality of information collected. Collecting GPS data from smartphones places a lower burden on respondents, offers greater spatio-temporal precision and has lower implementation costs (Vij and Shankari, 2015). The main drawback of using smartphones is their reliance on energy-intensive GPS services that quickly draw-down the smartphone battery, thus reducing the desire of travelers to use them.
Travel mode imputation using GPS and accelerometer data from a multi-day travel survey
2019, Journal of Transport Geography
- 1
Tel.: +1 (650) 454 6036.