When is big data big enough? Implications of using GPS-based surveys for travel demand analysis

https://doi.org/10.1016/j.trc.2015.04.025Get rights and content

Highlights

  • We assess the implications of using GPS-based surveys for travel demand analysis.

  • Multiple simulated datasets and real data are used.

  • Errors in inference may offset gains in the volume of data.

  • Benefits will vary with sample size, inference accuracy and model complexity.

Abstract

A number of studies in the last decade have argued that Global Positioning Systems (GPS) based survey offer the potential to replace traditional travel diary surveys. GPS-based surveys impose lower respondent burden, offer greater spatiotemporal precision and incur fewer monetary costs. However, GPS-based surveys do not collect certain key inputs required for the estimation of travel demand models, such as the travel mode(s) taken or the trip purpose, relying instead on data-processing procedures to infer this information. This study assesses the impact that errors in inference can have on travel demand models estimated using data from GPS-based surveys and proposes ways in which these errors can be controlled for during both data collection and model estimation. We use simulated datasets to compare performance across different sample sizes, inference accuracies, model complexities and estimation methods. Findings from the simulated datasets are corroborated with real data collected from individuals living in the San Francisco Bay Area, United States. Results indicate that the benefits of using GPS-based surveys will vary significantly, depending upon the sample size of the data, the accuracy of the inference algorithm and the desired complexity of the travel demand model specification. In many cases, gains in the volume of data that can potentially be retrieved using GPS devices are found to be offset by the loss in quality caused by inaccuracies in inference. This study makes the argument that passively collected GPS-based surveys may never entirely replace surveys that require active interaction with study participants.

Introduction

Traditional models of individual and household travel and activity behavior are estimated using travel diary datasets that ask a small subset of the population of interest to record over a period of one or two days which activities were conducted where, when, for how long, with whom and using what mode of travel. For example, the New York Best Practices Model (NYBPM), the activity-based model of travel demand developed for the New York Metropolitan Region, was estimated using travel diary data from about 28,000 individuals collected over a period of one day (Parsons Brinkerhoff Quade & Douglas, Inc., 2005). The size of that and other similar travel diary datasets pales in comparison to the volume of information that can potentially be retrieved from new technologies, such as Global Positioning System (GPS) sensors and smartphones, and social media platforms, such as Twitter and Facebook, both now and in the future.

Advances in GPS technologies in particular have received substantive attention in the last decade (for an exhaustive review of the literature, the reader is referred to Zmud et al., 2013, Shen and Stopher, 2014, and Wolf et al. (2014)). Early applications sought to supplement extant methods of travel diary data collection that rely on self-reporting, such as mail-back, phone-based or door-to-door travel diary surveys, through their ability to control for factors such as trip underreporting (Pierce et al., 2003, Wolf et al., 2003, Zmud and Wolf, 2003, Bricka and Bhat, 2007, Bachman et al., 2012). A number of regional household travel surveys have since adopted GPS-based surveys as a subcomponent to improve the accuracy and completeness of data collected using these more traditional methods. Recent examples include the Atlanta Regional Travel Survey, conducted in 2011, and the California Household Travel Survey, conducted in 2012 and 2013.

However, the long-term objective has always been the development of GPS-based surveys that can collect all the information that is usually collected by traditional travel diary surveys, but with very little input from survey participants. Thus far, there have been three large-scale travel surveys that have relied entirely on GPS sensors for the collection of travel diary data. The Ohio Department of Transportation initiated the first exclusively GPS household travel survey, conducted in the Cincinnati metropolitan area over 2009 and 2010 with a sample of 2583 households (Stopher et al., 2012), and followed it with the Northeast Ohio Regional Travel Survey conducted over 2012 and 2013 in the Cleveland metropolitan area with a sample of 4545 households. The regional planning agency in Jerusalem administered a similar exclusively GPS household travel survey over 2010 and 2011 with a sample of 8800 households. As the technology matures, GPS-based surveys are expected eventually to replace traditional travel diary surveys (Wolf et al., 2001, Stopher et al., 2008).

The advantages of using GPS-based surveys are manifold. They impose fewer requirements on survey respondents, offer greater spatiotemporal precision and are potentially cheaper to implement. Nevertheless, as pointed out by Shen and Stopher (2014), GPS-based surveys “cannot record travel mode, trip purpose or the number of occupants in a private vehicle—all important attributes in a traditional travel survey. Therefore, data-processing procedures become critical to the usefulness of GPS surveys, because there would be insufficient information for travel modeling purposes without the results of the processing.” Existing GPS-based surveys ask a subset of the sample population to participate in prompted-recall surveys where the respondents are asked to confirm trip details. The data thus collected is used to infer the same trip details for the remainder of the sample population. Numerous algorithms have been proposed for inferring one or more of these missing details from the GPS data, augmented in many cases with additional sources, such as accelerometer readings from smartphones (e.g. Feng and Timmermans, 2013), land use characteristics from Geographic Information Systems (GIS) databases (e.g. Bohte and Maat, 2009) or ‘check-ins’ from social media applications (e.g. Hasan and Ukkusuri, 2014). However, even the most successful inference algorithm will have some error associated with it. Most published studies in the literature report average accuracies of 60–90%. For example, the best practices travel mode inference model recommended by the National Cooperative Highway Research Program (NCHRP) Report 775, titled ‘Applying GPS Data to Understand Travel Behavior,’ has an average accuracy of 82% (Wolf et al., 2014). Errors in inference can potentially compromise the quality of data collected through GPS-based surveys and the validity of travel demand models developed using this data. And yet, to the best of our knowledge, no study has systematically examined the implications of using low-quality big data for traditional modes of analyses.

The objectives of this study are twofold: (1) to evaluate the impact of errors in inference on travel demand models estimated with GPS data using simulated and real datasets; and (2) to develop ways in which these errors can be controlled for through data collection methods and estimation procedures. To be clear, we do not wish to compare the performance of different inference algorithms. There are numerous studies in the literature that have already done so (refer to Shen and Stopher (2014) or Wolf et al. (2014) for a comprehensive list). Rather, our objective is to determine, for a given inference algorithm with a measured level of accuracy and precision that is used to impute missing information from GPS-based surveys, the validity of travel demand models estimated using this imputed data.

The paper is structured as follows: Section 2 describes a series of Monte Carlo experiments that compare model performance across different sample sizes, inference accuracies, model complexities and estimation methods; Section 3 uses actual GPS data collected from individuals residing in the San Francisco Bay Area, United States to corroborate findings from the Monte Carlo experiments; and Section 4 concludes the paper with a summary of findings and implications.

Section snippets

Monte Carlo experiment

In this section, we simulate a series of three Monte Carlo experiments to assess the impact of inference errors on estimation results as a function of the accuracy of the inference algorithm, the complexity of the desired travel demand model specification, the sample size of the observed data and the estimation method used to recover parameter estimates. A Monte Carlo experiment is especially useful because the true parameters underlying the data generating process are known, and the impact of

Case study: GPS-based survey in the San Francisco Bay Area, United States

In this section, we corroborate findings from Section 2 using real data collected from smartphone users living in the San Francisco Bay Area, United States through the means of an app called E-Mission. The app is being developed by a team of researchers at the University of California (UC), Berkeley. One of the objectives of E-Mission is to collect all the information that is usually collected by travel diary surveys, but with minimal input from the smartphone user. For more details about the

Conclusions

The last few years have been witness to great excitement over big data and its potential to address a multitude of societal problems, within transportation engineering and without, on an unprecedented scale and level of detail. The National Science Foundation’s call for research proposals in 2014 on “Critical Techniques and Technologies for Advancing Big Data Science and Engineering”, the Transportation Research Board’s call for papers on “Big Data, ICTs, and Travel Demand Models” for its 93rd

Acknowledgements

This research was funded in part by the Jim Gray Fellowship and in part by the NSF ActionWebs CPS-0931843. Jim Gray, who did pioneering work on the management of large amounts of data, disappeared while sailing in the San Francisco Bay in 2007. We hope that he would have found this exploration into data sizes and accuracies interesting. We wish to thank Mogeng Yin, Shanthi Shanmugam and Ryan Lei for their help in developing E-Mission, the smartphone app used by this study for data collection.

References (30)

  • Carrel, A., Lau, P.S.C., Mishalani, R.G., Sengupta, R., Walker, J.L., 2015. Quantifying transit travel experiences from...
  • Gonzalez, P., Weinstein, J.S., Barbeau, S.J., Labrador, M.A., Winters, P.L., Georggi, N.L., Perez, R., 2008. Automating...
  • Hemminki, S., Nurmi, P., Tarkoma, S., 2013. Accelerometer-based transportation mode detection on smartphones. In:...
  • Jones, E., Oliphant, T., Peterson, P., 2014. SciPy: Open source scientific tools for Python <http://www.scipy.org/>...
  • D. McFadden

    Economic choices

    Am. Econ. Rev.

    (2001)
  • Cited by (49)

    • Deriving transport appraisal values from emerging revealed preference data

      2022, Transportation Research Part A: Policy and Practice
    • Developing a passive GPS tracking system to study long-term travel behavior

      2019, Transportation Research Part C: Emerging Technologies
      Citation Excerpt :

      The use of GPS data, for instance provided by smartphones, can significantly improve the efficiency of travel diary studies and the quality of information collected. Collecting GPS data from smartphones places a lower burden on respondents, offers greater spatio-temporal precision and has lower implementation costs (Vij and Shankari, 2015). The main drawback of using smartphones is their reliance on energy-intensive GPS services that quickly draw-down the smartphone battery, thus reducing the desire of travelers to use them.

    View all citing articles on Scopus
    1

    Tel.: +1 (650) 454 6036.

    View full text