Skip to main content

Synthetic Data via Quantile Regression for Heavy-Tailed and Heteroskedastic Data

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2018)

Abstract

Privacy protection of confidential data is a fundamental problem faced by many government organizations and research centers. It is further complicated when data have complex structures or variables with highly skewed distributions. The statistical community addresses general privacy concerns by introducing different techniques that aim to decrease disclosure risk in released data while retaining their statistical properties. However, methods for complex data structures have received insufficient attention. We propose producing synthetic data via quantile regression to address privacy protection of heavy-tailed and heteroskedastic data. We address some shortcomings of the previously proposed use of quantile regression as a synthesis method and extend the work into cases where data have heavy tails or heteroskedastic errors. Using a simulation study and two applications, we show that there are settings where quantile regression performs as well as or better than other commonly used synthesis methods on the basis of maintaining good data utility while simultaneously decreasing disclosure risk.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    All code is available on our Github repository at https://github.com/labordynamicsinstitute/replication_qr_synthetic.

  2. 2.

    We are discussing applying the methodology to the confidential data for the purpose of generating a new release of synthetic data.

  3. 3.

    Note that neither geography nor the full SIC code were used in the quantile regression synthesis.

References

  1. Benedetto, G., Stinson, M., Abowd, J.M.: The creation and use of the SIPP Synthetic Beta. Mimeo, U.S. Census Bureau, April 2013. http://hdl.handle.net/1813/43924

  2. Benoit, D.F., Van den Poel, D.: bayesQR: A Bayesian approach to quantile regression. J. Stat. Softw. 76(7), 1–32 (2017)

    Article  Google Scholar 

  3. Bondell, H.D., Reich, B.J., Wang, H.: Noncrossing quantile regression curve estimation. Biometrika 97(4), 825–838 (2010)

    Article  MathSciNet  Google Scholar 

  4. Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3(1), 27–42 (2010)

    MathSciNet  Google Scholar 

  5. Chernozhukov, V., Fernández-Val, I., Galichon, A.: Quantile and probability curves without crossing. Econometrica 78(3), 1093–1125 (2010)

    Article  MathSciNet  Google Scholar 

  6. Drechsler, J.: Synthetic datasets for the German IAB establishment panel. Invited Paper WP.10, Joint UNECE/Eurostat work session on statistical data confidentiality (2009). http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2009/wp.10.e.pdf

  7. Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. LNS, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5

    Book  MATH  Google Scholar 

  8. Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014)

    MathSciNet  MATH  Google Scholar 

  9. Dwork, C., Smith, A., Steinke, T., Ullman, J.: Exposed! a survey of attacks on private data. Annu. Rev. Stat. Appl. 4, 61–84 (2017)

    Article  Google Scholar 

  10. Fahrmeir, L., Kneib, T., Lang, S., Marx, B.: Regression: Models, Methods and Applications. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-34333-9

    Book  MATH  Google Scholar 

  11. Foschi, F.: Disclosure risk for high dimensional business microdata. In: Joint UNECE-Eurostat Work Session on Statistical Data Confidentiality, pp. 26–28 (2011). https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2011/03_Italy-Foschi.pdf

  12. Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006)

    Article  MathSciNet  Google Scholar 

  13. Hu, J., Reiter, J.P., Wang, Q.: Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Anal. 13(1), 183–200 (2018)

    Article  MathSciNet  Google Scholar 

  14. Huang, Q., Zhang, H., Chen, J., He, M.: Quantile regression models and their applications: a review. J. Biometr. Biostat. 8, 354 (2017)

    Google Scholar 

  15. Huckett, J.C., Larsen, M.D.: Microdata simulation for confidentiality of tax returns using quantile regression and hot deck. In: Proceedings of the Third International Conference on Establishment Data. American Statistical Association (2007)

    Google Scholar 

  16. Huckett, J.C., Larsen, M.D.: Microdata simulation for confidentiality protection using regression quantiles and hot deck. In: Proceedings of the Survey Research Methods Section. American Statistical Association (2007)

    Google Scholar 

  17. Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)

    Book  Google Scholar 

  18. Ichim, D.: Disclosure control of business microdata: a density-based approach. Int. Stat. Rev. 77(2), 196–211 (2009)

    Article  Google Scholar 

  19. Karr, A., Oganian, A., Reiter, J., Woo, M.J.: New measures of data utility. In: Workshop Manuscripts of Data Confidentiality, A Working Group in National Defense and Homeland Security (2006). http://sisla06.samsi.info/ndhs/dc/Papers/NewDataUtility-01-10-06.pdf

  20. Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic Longitudinal Business Database. Int. Stat. Rev. 79(3), 362–384 (2011)

    Article  Google Scholar 

  21. Koenker, R.: quantreg: Quantile Regression (2017). R package version 5.34: https://CRAN.R-project.org/package=quantreg

  22. Koenker, R., Bassett Jr., G.: Regression quantiles. Econometrica 46, 33–50 (1978)

    Article  MathSciNet  Google Scholar 

  23. Kozumi, H., Kobayashi, G.: Gibbs sampling methods for Bayesian quantile regression. J. Stat. Comput. Simul. 81(11), 1565–1578 (2011)

    Article  MathSciNet  Google Scholar 

  24. Larsen, M.D., Huckett, J.C.: Multimethod synthetic data generation for confidentiality and measurement of disclosure risk. Int. J. Inf. Priv. Secur. Integr. 2 1(2–3), 184–204 (2012)

    Google Scholar 

  25. Little, R.J.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)

    Google Scholar 

  26. Liu, Y., Wu, Y.: Simultaneous multiple non-crossing quantile regression estimation using kernel constraints. J. Nonparametr. Stat. 23(2), 415–437 (2011)

    Article  MathSciNet  Google Scholar 

  27. Machanavajjhala, A., Kifer, D., Abowd, J.M., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: International Conference on Data Engineering (ICDE), pp. 277–286 (2008). https://doi.org/10.1109/ICDE.2008.4497436

  28. Meinshausen, N.: Quantile regression forests. J. Mach. Learn. Res. 7(Jun), 983–999 (2006)

    MathSciNet  MATH  Google Scholar 

  29. Nowok, B., Raab, G.M., Dibben, C.: synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)

    Article  Google Scholar 

  30. Portnoy, S., Koenker, R.: The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Stat. Sci. 12(4), 279–300 (1997)

    Article  MathSciNet  Google Scholar 

  31. Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 67–97 (2017)

    Article  Google Scholar 

  32. RDC- Cornell University: Synthetic Data Server (2018). https://www2.vrdc.cornell.edu/news/synthetic-data-server/

  33. Reiter, J.P.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441–462 (2005)

    Google Scholar 

  34. Rizzo, M.L.: Statistical Computing with R. CRC Press, Boca Raton (2007)

    MATH  Google Scholar 

  35. Rubin, D.B.: Discussion: statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)

    Google Scholar 

  36. Scottish Longitudinal Study Development and Support Unit: Synthetic Data (2018). https://sls.lscs.ac.uk/guides-resources/synthetic-data/

  37. Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc.: Ser. A (Stat. Soc.) 181(3), 663–688 (2018)

    Article  MathSciNet  Google Scholar 

  38. Therneau, T., Atkinson, B., Ripley, B.: rpart: Recursive Partitioning and Regression Trees (2017). R package version 4.1-11: https://CRAN.R-project.org/package=rpart

  39. University of Essex Department of History: I-CeM: Integrated Census Microdata Project (2018). https://www1.essex.ac.uk/history/research/icem/

  40. Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1(1), 111–124 (2009)

    Google Scholar 

  41. Woodcock, S.D., Benedetto, G.: Distribution-preserving statistical disclosure limitation. Comput. Stat. Data Anal. 53(12), 4228–4242 (2009)

    Article  MathSciNet  Google Scholar 

  42. Yu, K., Lu, Z., Stander, J.: Quantile regression: applications and current research areas. J. Roy. Stat. Soc.: Ser. D (Statistician) 52(3), 331–350 (2003)

    MathSciNet  Google Scholar 

  43. Yu, K., Moyeed, R.A.: Bayesian quantile regression. Stat. Probab. Lett. 54(4), 437–447 (2001)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

The Synthetic LBD data were accessed through the Synthetic Data Server at Cornell University, which is funded through NSF Grant SES-1042181 and BCS-0941226 and a grant from the Alfred P. Sloan foundation. Access to the Synthetic LBD is described at https://www2.vrdc.cornell.edu/news/synthetic-data-server/step-1-requesting-access-to-sds/. Use of the Integrated Census Microdata Project at the University of Essex was facilitated by Gillian Raab.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michelle Pistner .

Editor information

Editors and Affiliations

A Supplementary Materials for 1901 Census of Scotland

A Supplementary Materials for 1901 Census of Scotland

There were a total of 82,851 observations in our extract. Of these observations, 20,303 were female and 62,548 were male. Additional statistics follow (Tables 4, 5 and Fig. 3).

Table 4. Summary statistics for continuous variables for extract of 1901 Census of Scotland. Note that the count variables had very heavy tails.
Table 5. Count of records by marital status.
Fig. 3.
figure 3

Density for the number of servants. Note: Most observations have a value of zero.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pistner, M., Slavković, A., Vilhuber, L. (2018). Synthetic Data via Quantile Regression for Heavy-Tailed and Heteroskedastic Data. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99771-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99770-4

  • Online ISBN: 978-3-319-99771-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics