Synthetic Data via Quantile Regression for Heavy-Tailed and Heteroskedastic Data

Pistner, Michelle; Slavković, Aleksandra; Vilhuber, Lars

doi:10.1007/978-3-319-99771-1_7

Michelle Pistner¹⁵,
Aleksandra Slavković¹⁵ &
Lars Vilhuber ORCID: orcid.org/0000-0001-5733-8932¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11126))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

915 Accesses
1 Citations

Abstract

Privacy protection of confidential data is a fundamental problem faced by many government organizations and research centers. It is further complicated when data have complex structures or variables with highly skewed distributions. The statistical community addresses general privacy concerns by introducing different techniques that aim to decrease disclosure risk in released data while retaining their statistical properties. However, methods for complex data structures have received insufficient attention. We propose producing synthetic data via quantile regression to address privacy protection of heavy-tailed and heteroskedastic data. We address some shortcomings of the previously proposed use of quantile regression as a synthesis method and extend the work into cases where data have heavy tails or heteroskedastic errors. Using a simulation study and two applications, we show that there are settings where quantile regression performs as well as or better than other commonly used synthesis methods on the basis of maintaining good data utility while simultaneously decreasing disclosure risk.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
All code is available on our Github repository at https://github.com/labordynamicsinstitute/replication_qr_synthetic.
2.
We are discussing applying the methodology to the confidential data for the purpose of generating a new release of synthetic data.
3.
Note that neither geography nor the full SIC code were used in the quantile regression synthesis.

References

Benedetto, G., Stinson, M., Abowd, J.M.: The creation and use of the SIPP Synthetic Beta. Mimeo, U.S. Census Bureau, April 2013. http://hdl.handle.net/1813/43924
Benoit, D.F., Van den Poel, D.: bayesQR: A Bayesian approach to quantile regression. J. Stat. Softw. 76(7), 1–32 (2017)
Article Google Scholar
Bondell, H.D., Reich, B.J., Wang, H.: Noncrossing quantile regression curve estimation. Biometrika 97(4), 825–838 (2010)
Article MathSciNet Google Scholar
Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3(1), 27–42 (2010)
MathSciNet Google Scholar
Chernozhukov, V., Fernández-Val, I., Galichon, A.: Quantile and probability curves without crossing. Econometrica 78(3), 1093–1125 (2010)
Article MathSciNet Google Scholar
Drechsler, J.: Synthetic datasets for the German IAB establishment panel. Invited Paper WP.10, Joint UNECE/Eurostat work session on statistical data confidentiality (2009). http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2009/wp.10.e.pdf
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. LNS, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Book MATH Google Scholar
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014)
MathSciNet MATH Google Scholar
Dwork, C., Smith, A., Steinke, T., Ullman, J.: Exposed! a survey of attacks on private data. Annu. Rev. Stat. Appl. 4, 61–84 (2017)
Article Google Scholar
Fahrmeir, L., Kneib, T., Lang, S., Marx, B.: Regression: Models, Methods and Applications. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-34333-9
Book MATH Google Scholar
Foschi, F.: Disclosure risk for high dimensional business microdata. In: Joint UNECE-Eurostat Work Session on Statistical Data Confidentiality, pp. 26–28 (2011). https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2011/03_Italy-Foschi.pdf
Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006)
Article MathSciNet Google Scholar
Hu, J., Reiter, J.P., Wang, Q.: Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Anal. 13(1), 183–200 (2018)
Article MathSciNet Google Scholar
Huang, Q., Zhang, H., Chen, J., He, M.: Quantile regression models and their applications: a review. J. Biometr. Biostat. 8, 354 (2017)
Google Scholar
Huckett, J.C., Larsen, M.D.: Microdata simulation for confidentiality of tax returns using quantile regression and hot deck. In: Proceedings of the Third International Conference on Establishment Data. American Statistical Association (2007)
Google Scholar
Huckett, J.C., Larsen, M.D.: Microdata simulation for confidentiality protection using regression quantiles and hot deck. In: Proceedings of the Survey Research Methods Section. American Statistical Association (2007)
Google Scholar
Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)
Book Google Scholar
Ichim, D.: Disclosure control of business microdata: a density-based approach. Int. Stat. Rev. 77(2), 196–211 (2009)
Article Google Scholar
Karr, A., Oganian, A., Reiter, J., Woo, M.J.: New measures of data utility. In: Workshop Manuscripts of Data Confidentiality, A Working Group in National Defense and Homeland Security (2006). http://sisla06.samsi.info/ndhs/dc/Papers/NewDataUtility-01-10-06.pdf
Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic Longitudinal Business Database. Int. Stat. Rev. 79(3), 362–384 (2011)
Article Google Scholar
Koenker, R.: quantreg: Quantile Regression (2017). R package version 5.34: https://CRAN.R-project.org/package=quantreg
Koenker, R., Bassett Jr., G.: Regression quantiles. Econometrica 46, 33–50 (1978)
Article MathSciNet Google Scholar
Kozumi, H., Kobayashi, G.: Gibbs sampling methods for Bayesian quantile regression. J. Stat. Comput. Simul. 81(11), 1565–1578 (2011)
Article MathSciNet Google Scholar
Larsen, M.D., Huckett, J.C.: Multimethod synthetic data generation for confidentiality and measurement of disclosure risk. Int. J. Inf. Priv. Secur. Integr. 2 1(2–3), 184–204 (2012)
Google Scholar
Little, R.J.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)
Google Scholar
Liu, Y., Wu, Y.: Simultaneous multiple non-crossing quantile regression estimation using kernel constraints. J. Nonparametr. Stat. 23(2), 415–437 (2011)
Article MathSciNet Google Scholar
Machanavajjhala, A., Kifer, D., Abowd, J.M., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: International Conference on Data Engineering (ICDE), pp. 277–286 (2008). https://doi.org/10.1109/ICDE.2008.4497436
Meinshausen, N.: Quantile regression forests. J. Mach. Learn. Res. 7(Jun), 983–999 (2006)
MathSciNet MATH Google Scholar
Nowok, B., Raab, G.M., Dibben, C.: synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)
Article Google Scholar
Portnoy, S., Koenker, R.: The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Stat. Sci. 12(4), 279–300 (1997)
Article MathSciNet Google Scholar
Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 67–97 (2017)
Article Google Scholar
RDC- Cornell University: Synthetic Data Server (2018). https://www2.vrdc.cornell.edu/news/synthetic-data-server/
Reiter, J.P.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441–462 (2005)
Google Scholar
Rizzo, M.L.: Statistical Computing with R. CRC Press, Boca Raton (2007)
MATH Google Scholar
Rubin, D.B.: Discussion: statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Google Scholar
Scottish Longitudinal Study Development and Support Unit: Synthetic Data (2018). https://sls.lscs.ac.uk/guides-resources/synthetic-data/
Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc.: Ser. A (Stat. Soc.) 181(3), 663–688 (2018)
Article MathSciNet Google Scholar
Therneau, T., Atkinson, B., Ripley, B.: rpart: Recursive Partitioning and Regression Trees (2017). R package version 4.1-11: https://CRAN.R-project.org/package=rpart
University of Essex Department of History: I-CeM: Integrated Census Microdata Project (2018). https://www1.essex.ac.uk/history/research/icem/
Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1(1), 111–124 (2009)
Google Scholar
Woodcock, S.D., Benedetto, G.: Distribution-preserving statistical disclosure limitation. Comput. Stat. Data Anal. 53(12), 4228–4242 (2009)
Article MathSciNet Google Scholar
Yu, K., Lu, Z., Stander, J.: Quantile regression: applications and current research areas. J. Roy. Stat. Soc.: Ser. D (Statistician) 52(3), 331–350 (2003)
MathSciNet Google Scholar
Yu, K., Moyeed, R.A.: Bayesian quantile regression. Stat. Probab. Lett. 54(4), 437–447 (2001)
Article MathSciNet Google Scholar

Download references

Acknowledgments

The Synthetic LBD data were accessed through the Synthetic Data Server at Cornell University, which is funded through NSF Grant SES-1042181 and BCS-0941226 and a grant from the Alfred P. Sloan foundation. Access to the Synthetic LBD is described at https://www2.vrdc.cornell.edu/news/synthetic-data-server/step-1-requesting-access-to-sds/. Use of the Integrated Census Microdata Project at the University of Essex was facilitated by Gillian Raab.

Author information

Authors and Affiliations

Department of Statistics, The Pennsylvania State University, University Park, PA, USA
Michelle Pistner & Aleksandra Slavković
Economics Department, Cornell University, Ithaca, NY, USA
Lars Vilhuber

Authors

Michelle Pistner
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandra Slavković
View author publications
You can also search for this author in PubMed Google Scholar
Lars Vilhuber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michelle Pistner .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Spain
Josep Domingo-Ferrer
University of Valencia, Burjassot, Spain
Francisco Montes

A Supplementary Materials for 1901 Census of Scotland

There were a total of 82,851 observations in our extract. Of these observations, 20,303 were female and 62,548 were male. Additional statistics follow (Tables 4, 5 and Fig. 3).

Table 4. Summary statistics for continuous variables for extract of 1901 Census of Scotland. Note that the count variables had very heavy tails.

Full size table

Table 5. Count of records by marital status.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pistner, M., Slavković, A., Vilhuber, L. (2018). Synthetic Data via Quantile Regression for Heavy-Tailed and Heteroskedastic Data. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-99771-1_7
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99770-4
Online ISBN: 978-3-319-99771-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Synthetic Data via Quantile Regression for Heavy-Tailed and Heteroskedastic Data

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Supplementary Materials for 1901 Census of Scotland

A Supplementary Materials for 1901 Census of Scotland

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation