Abstract
Privacy protection of confidential data is a fundamental problem faced by many government organizations and research centers. It is further complicated when data have complex structures or variables with highly skewed distributions. The statistical community addresses general privacy concerns by introducing different techniques that aim to decrease disclosure risk in released data while retaining their statistical properties. However, methods for complex data structures have received insufficient attention. We propose producing synthetic data via quantile regression to address privacy protection of heavy-tailed and heteroskedastic data. We address some shortcomings of the previously proposed use of quantile regression as a synthesis method and extend the work into cases where data have heavy tails or heteroskedastic errors. Using a simulation study and two applications, we show that there are settings where quantile regression performs as well as or better than other commonly used synthesis methods on the basis of maintaining good data utility while simultaneously decreasing disclosure risk.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
All code is available on our Github repository at https://github.com/labordynamicsinstitute/replication_qr_synthetic.
- 2.
We are discussing applying the methodology to the confidential data for the purpose of generating a new release of synthetic data.
- 3.
Note that neither geography nor the full SIC code were used in the quantile regression synthesis.
References
Benedetto, G., Stinson, M., Abowd, J.M.: The creation and use of the SIPP Synthetic Beta. Mimeo, U.S. Census Bureau, April 2013. http://hdl.handle.net/1813/43924
Benoit, D.F., Van den Poel, D.: bayesQR: A Bayesian approach to quantile regression. J. Stat. Softw. 76(7), 1–32 (2017)
Bondell, H.D., Reich, B.J., Wang, H.: Noncrossing quantile regression curve estimation. Biometrika 97(4), 825–838 (2010)
Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3(1), 27–42 (2010)
Chernozhukov, V., Fernández-Val, I., Galichon, A.: Quantile and probability curves without crossing. Econometrica 78(3), 1093–1125 (2010)
Drechsler, J.: Synthetic datasets for the German IAB establishment panel. Invited Paper WP.10, Joint UNECE/Eurostat work session on statistical data confidentiality (2009). http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2009/wp.10.e.pdf
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. LNS, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014)
Dwork, C., Smith, A., Steinke, T., Ullman, J.: Exposed! a survey of attacks on private data. Annu. Rev. Stat. Appl. 4, 61–84 (2017)
Fahrmeir, L., Kneib, T., Lang, S., Marx, B.: Regression: Models, Methods and Applications. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-34333-9
Foschi, F.: Disclosure risk for high dimensional business microdata. In: Joint UNECE-Eurostat Work Session on Statistical Data Confidentiality, pp. 26–28 (2011). https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2011/03_Italy-Foschi.pdf
Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006)
Hu, J., Reiter, J.P., Wang, Q.: Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Anal. 13(1), 183–200 (2018)
Huang, Q., Zhang, H., Chen, J., He, M.: Quantile regression models and their applications: a review. J. Biometr. Biostat. 8, 354 (2017)
Huckett, J.C., Larsen, M.D.: Microdata simulation for confidentiality of tax returns using quantile regression and hot deck. In: Proceedings of the Third International Conference on Establishment Data. American Statistical Association (2007)
Huckett, J.C., Larsen, M.D.: Microdata simulation for confidentiality protection using regression quantiles and hot deck. In: Proceedings of the Survey Research Methods Section. American Statistical Association (2007)
Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)
Ichim, D.: Disclosure control of business microdata: a density-based approach. Int. Stat. Rev. 77(2), 196–211 (2009)
Karr, A., Oganian, A., Reiter, J., Woo, M.J.: New measures of data utility. In: Workshop Manuscripts of Data Confidentiality, A Working Group in National Defense and Homeland Security (2006). http://sisla06.samsi.info/ndhs/dc/Papers/NewDataUtility-01-10-06.pdf
Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic Longitudinal Business Database. Int. Stat. Rev. 79(3), 362–384 (2011)
Koenker, R.: quantreg: Quantile Regression (2017). R package version 5.34: https://CRAN.R-project.org/package=quantreg
Koenker, R., Bassett Jr., G.: Regression quantiles. Econometrica 46, 33–50 (1978)
Kozumi, H., Kobayashi, G.: Gibbs sampling methods for Bayesian quantile regression. J. Stat. Comput. Simul. 81(11), 1565–1578 (2011)
Larsen, M.D., Huckett, J.C.: Multimethod synthetic data generation for confidentiality and measurement of disclosure risk. Int. J. Inf. Priv. Secur. Integr. 2 1(2–3), 184–204 (2012)
Little, R.J.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)
Liu, Y., Wu, Y.: Simultaneous multiple non-crossing quantile regression estimation using kernel constraints. J. Nonparametr. Stat. 23(2), 415–437 (2011)
Machanavajjhala, A., Kifer, D., Abowd, J.M., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: International Conference on Data Engineering (ICDE), pp. 277–286 (2008). https://doi.org/10.1109/ICDE.2008.4497436
Meinshausen, N.: Quantile regression forests. J. Mach. Learn. Res. 7(Jun), 983–999 (2006)
Nowok, B., Raab, G.M., Dibben, C.: synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)
Portnoy, S., Koenker, R.: The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Stat. Sci. 12(4), 279–300 (1997)
Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 67–97 (2017)
RDC- Cornell University: Synthetic Data Server (2018). https://www2.vrdc.cornell.edu/news/synthetic-data-server/
Reiter, J.P.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441–462 (2005)
Rizzo, M.L.: Statistical Computing with R. CRC Press, Boca Raton (2007)
Rubin, D.B.: Discussion: statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Scottish Longitudinal Study Development and Support Unit: Synthetic Data (2018). https://sls.lscs.ac.uk/guides-resources/synthetic-data/
Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc.: Ser. A (Stat. Soc.) 181(3), 663–688 (2018)
Therneau, T., Atkinson, B., Ripley, B.: rpart: Recursive Partitioning and Regression Trees (2017). R package version 4.1-11: https://CRAN.R-project.org/package=rpart
University of Essex Department of History: I-CeM: Integrated Census Microdata Project (2018). https://www1.essex.ac.uk/history/research/icem/
Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1(1), 111–124 (2009)
Woodcock, S.D., Benedetto, G.: Distribution-preserving statistical disclosure limitation. Comput. Stat. Data Anal. 53(12), 4228–4242 (2009)
Yu, K., Lu, Z., Stander, J.: Quantile regression: applications and current research areas. J. Roy. Stat. Soc.: Ser. D (Statistician) 52(3), 331–350 (2003)
Yu, K., Moyeed, R.A.: Bayesian quantile regression. Stat. Probab. Lett. 54(4), 437–447 (2001)
Acknowledgments
The Synthetic LBD data were accessed through the Synthetic Data Server at Cornell University, which is funded through NSF Grant SES-1042181 and BCS-0941226 and a grant from the Alfred P. Sloan foundation. Access to the Synthetic LBD is described at https://www2.vrdc.cornell.edu/news/synthetic-data-server/step-1-requesting-access-to-sds/. Use of the Integrated Census Microdata Project at the University of Essex was facilitated by Gillian Raab.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Pistner, M., Slavković, A., Vilhuber, L. (2018). Synthetic Data via Quantile Regression for Heavy-Tailed and Heteroskedastic Data. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-99771-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99770-4
Online ISBN: 978-3-319-99771-1
eBook Packages: Computer ScienceComputer Science (R0)