K-Anonymity: A Note on the Trade-Off between Data Utility and Data Security

23 Pages Posted: 6 Sep 2017

See all articles by Tatiana Komarova

Tatiana Komarova

Department of Economics, University of Manchester

Denis Nekipelov

University of Virginia

Ahnaf Rafi

Northwestern University; London School of Economics & Political Science (LSE) - Department of Economics

Evgeny Yakovlev

New Economic School; SciencesPo - Sciences Po - Department of Economics; IZA

Date Written: August 24, 2017

Abstract

Researchers often use data from multiple datasets to conduct credible econometric and statistical analysis. The most reliable way to link entries across such datasets is to exploit unique identifiers if those are available. Such linkage however may result in privacy violations revealing sensitive information about some individuals in a sample. Thus, a data curator with concerns for individual privacy may choose to remove certain individual information from the private dataset they plan on releasing to researchers. The extent of individual information the data curator keeps in the private dataset can still allow a researcher to link the datasets, most likely with some errors, and usually results in a researcher having several feasible combined datasets. One conceptual framework a data curator may rely on is k-anonymity, k>=2, which gained wide popularity in computer science and statistical community. To ensure k-anonymity, the data curator releases only the amount of identifying information in the private dataset that guarantees that every entry in it can be linked to at least k different entries in the publicly available datasets the researcher will use. In this paper, we look at the data combination task and the estimation task from both perspectives -- from the perspective of the researcher estimating the model and from the perspective of a data curator who restricts identifying information in the private dataset to make sure that k-anonymity holds. We illustrate how to construct identifiers in practice and use them to combine some entries across two datasets. We also provide an empirical illustration on how a data curator can ensure k-anonymity and consequences it has on the estimation procedure. Naturally, the utility of the combined data gets smaller as k increases, which is also evident from our empirical illustration.

Keywords: Data protection, data combination, k-anonymity

JEL Classification: C35, C14, C25, C13

Suggested Citation

Komarova, Tatiana and Nekipelov, Denis and Rafi, Ahnaf and Yakovlev, Evgeny, K-Anonymity: A Note on the Trade-Off between Data Utility and Data Security (August 24, 2017). Available at SSRN: https://ssrn.com/abstract=3030386 or http://dx.doi.org/10.2139/ssrn.3030386

Tatiana Komarova (Contact Author)

Department of Economics, University of Manchester ( email )

Arthur Lewis Building
Oxford Road
Manchester, M13 9PL
United Kingdom

Denis Nekipelov

University of Virginia ( email )

1400 University Ave
Charlottesville, VA 22903
United States

Ahnaf Rafi

Northwestern University ( email )

2001 Sheridan Road
Evanston, IL 60208
United States

London School of Economics & Political Science (LSE) - Department of Economics ( email )

Houghton Street
London WC2A 2AE
United Kingdom

Evgeny Yakovlev

New Economic School ( email )

Skolkovskoe shosse 45
Moscow, 121343
Russia

SciencesPo - Sciences Po - Department of Economics ( email )

28, rue des Saints-Pères
Paris, Paris 75007
France

IZA ( email )

P.O. Box 7240
Bonn, D-53072
Germany

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
165
Abstract Views
1,332
Rank
328,416
PlumX Metrics