Probabilistic Anonymity

Lodha, Sachin; Thomas, Dilys

doi:10.1007/978-3-540-78478-4_4

Sachin Lodha¹ &
Dilys Thomas²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4890))

Included in the following conference series:

International Workshop on Privacy, Security, and Trust in KDD

674 Accesses
10 Citations

Abstract

In this age of globalization, organizations need to publish their micro-data owing to legal directives or share it with business associates in order to remain competitive. This puts personal privacy at risk. To surmount this risk, attributes that clearly identify individuals, such as Name, Social Security Number, and Driving License Number, are generally removed or re- placed by random values. But this may not be enough because such de-identified databases can sometimes be joined with other public databases on attributes such as Gender, Date of Birth, and Zipcode to re-identify individuals who were supposed to remain anonymous. In the literature, such an identity-leaking attribute combination is called as a quasi-identifier. It is always critical to be able to recognize quasi-identifiers and to apply to them appropriate protective measures to mitigate the identity disclosure risk posed by join attacks.

In this paper, we start out by providing the first formal characterization and a practical technique to identify quasi-identifiers. We show an interesting connection between whether a set of columns forms a quasi-identifier and the number of distinct values assumed by the combination of the columns. We then use this characterization to come up with a probabilistic notion of anonymity. Again we show an interesting connection between the number of distinct values taken by a combination of columns and the anonymity it can offer. This allows us to find an ideal amount of generalization or suppression to apply to different columns in order to achieve probabilistic anonymity. We work through many examples and show that our analysis can be used to make a published database conform to privacy rules like HIPAA. In order to achieve probabilistic anonymity, we observe that one needs to solve multiple 1-dimensional k-anonymity problems. We propose many efficient and scalable algorithms for achieving 1-dimensional anonymity. Our algorithms are optimal in a sense that they minimally distort data and retain much of its utility.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Accuracy of the US census data, U.S. Census Bereau, http://www.census.gov/acs/www/UseData/Accuracy/Accuracy1.htm
Public use microdata sample (PUMS), U.S. Census Bureau, http://www.census.gov/acs/www/Products/PUMS/
Aggarwal, C.C.: On k-anonymity and the curse of dimensionality. In: Proceedings of the 2005 International Conference on Very Large Data Bases, pp. 901–909 (2005)
Google Scholar
Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Anonymizing tables. In: Proceedings of the International Conference on Database Theory, pp. 246–258 (2005)
Google Scholar
Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Approximation algorithms for k-Anonymity. Journal of Privacy Technology, 20051120001 (2005); Earlier version appeared in Proc. of the Intl. Conf. on Database Theory (ICDT 2005)
Google Scholar
Aggarwal, G., Feder, T., Kenthapadi, K., Panigrahy, R., Thomas, D., Zhu, A.: Clustering for privacy. In: Proceedings of the ACM Symposium on Principles of Database Systems (2006)
Google Scholar
Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proceedings of the International Conference on Very Large Data Bases, Santiago, Chile, pp. 487–499 (September 1994)
Google Scholar
Baum, K.: First estimates from the national crime victimization survey: Identity theft, 2004. In: Bureau of Justice Statistics Bulletin (April 2006), http://www.ojp.usdoj.gov/bjs/pub/pdf/it04.pdf
Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceedings of the International Conference on Data Engineering, pp. 217–228 (2005)
Google Scholar
Blake, C., Merz, C.: UCI repository of machine learning databases (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Brown, M.: Identity theft victim stories: Verbal testimony by michelle brown. In: Privacy Rights Clearing House (July 2000), http://www.privacyrights.org/cases/victim9.htm
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2003)
Google Scholar
Chawla, S., Dwork, C., McSherry, F., Smith, A., Wee, H.: Toward privacy in public databases. In: 2nd Theory of Cryptography Conference (TCC), pp. 363–385 (2005)
Google Scholar
Chawla, S., Dwork, C., McSherry, F., Talwar, K.: On the utility of privacy-preserving histograms. In: 21st Conference on Uncertainty in Artificial Intelligence (UAI) (2005)
Google Scholar
Chernoff, H.: Asymptotic efficiency for tests based on the sums of observations. Annals of Mathematical Statistics 23, 493–507 (1952)
Article MathSciNet Google Scholar
Dalenius, T.: Finding a needle in a haystack or identifying anonymous census records. Journal of Official Statistics (2), 329–336 (1986)
Google Scholar
Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of the International Conference on Very Large Data Bases, pp. 541–550 (2001)
Google Scholar
GLB. Gramm-Leach-Bliley Act, http://www.ftc.gov/privacy/privacyinitiatives/glbact.html
HIPAA. Health Information Portability and Accountability Act, http://www.hhs.gov/ocr/hipaa/
IBM. Privacy is good for business, http://www-306.ibm.com/innovation/us/customerloyalty/harriet_pearson_interview.shtml
Iyengar, V.: Transforming data to satisfy privacy constraints. In: 8th ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining, pp. 279–288 (2002)
Google Scholar
Jain, K., Vazirani, V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: Proceedings of the Annual IEEE Symposium on Foundations of Computer Science, pp. 2–13 (1999)
Google Scholar
Lefevre, K., Dewitt, D.J., Ramakrishnan, R.: Incognito: Efficient full domain k-anonymity. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 49–60 (2005)
Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. In: Proceedings of the International Conference on Data Engineering, p. 24 (2006)
Google Scholar
Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 251–262 (1999)
Google Scholar
Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 223–228 (June 2004)
Google Scholar
Munro, I., Paterson, M.: Selection and sorting with limited storage. In: Proceedings of the Annual IEEE Symposium on Foundations of Computer Science, pp. 253–258 (1978)
Google Scholar
Rudin, W.: Real and Complex Analysis. McGraw-Hill, New York (1987)
MATH Google Scholar
Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information (abstract). In: Proceedings of the ACM Symposium on Principles of Database Systems, p. 188 (1998)
Google Scholar
SOX. Sarbanes-Oxley Act, http://www.sec.gov/about/laws/soa2002.pdf
Sweeney, L.: Guaranteeing anonymity when sharing medical data, the datafly system. In: Proceedings of the Journal of the American Medical Informatics Association Annual Fall Symposium, pp. 51–55 (1997)
Google Scholar
Sweeney, L.: Three computational systems for disclosing medical data in the year 1999. In: Proceedings of MEDINFO, pp. 1124–1129 (1998)
Google Scholar
Sweeney, L.: Uniqueness of simple demographics in the U.S. population. In: LIDAP-WP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA (2000)
Google Scholar
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppresion. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5), 571–588 (2002)
Article MATH MathSciNet Google Scholar
Sweeney, L.: k-Anonymity: A model for preserving privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5), 557–570 (2002)
Article MATH MathSciNet Google Scholar
TRDDC. Masketeer: A tool for preserving privacy, Pune (2005)
Google Scholar
Vazirani, V.: Approximation Algorithms. Springer, Heidelberg (2004)
Google Scholar
Vitter, J.: Random sampling with a reservoir. In: ACM Transaction on Mathematical Software, pp. 37–57 (1985)
Google Scholar
Winkler, W.: Using simulated annealing for k-anonymity. In: Research Report 2002-07, US Census Bureau Statistical Research Division (November 2002)
Google Scholar
Xu, Y., Motwani, R.: Random sampling based algorithms for efficient semi-key discovery (2006), http://theory.stanford.edu/~xuying/papers/minkey_vldb.pdf

Download references

Author information

Authors and Affiliations

Tata Research Development and Design Centre, Pune, India
Sachin Lodha
Google Inc., Mountain View, USA
Dilys Thomas

Authors

Sachin Lodha
View author publications
You can also search for this author in PubMed Google Scholar
Dilys Thomas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Francesco Bonchi Elena Ferrari Bradley Malin Yücel Saygin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lodha, S., Thomas, D. (2008). Probabilistic Anonymity. In: Bonchi, F., Ferrari, E., Malin, B., Saygin, Y. (eds) Privacy, Security, and Trust in KDD. PInKDD 2007. Lecture Notes in Computer Science, vol 4890. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78478-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-540-78478-4_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78477-7
Online ISBN: 978-3-540-78478-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics