Skip to main content

Probabilistic Anonymity

  • Conference paper
Book cover Privacy, Security, and Trust in KDD (PInKDD 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4890))

Included in the following conference series:

Abstract

In this age of globalization, organizations need to publish their micro-data owing to legal directives or share it with business associates in order to remain competitive. This puts personal privacy at risk. To surmount this risk, attributes that clearly identify individuals, such as Name, Social Security Number, and Driving License Number, are generally removed or re- placed by random values. But this may not be enough because such de-identified databases can sometimes be joined with other public databases on attributes such as Gender, Date of Birth, and Zipcode to re-identify individuals who were supposed to remain anonymous. In the literature, such an identity-leaking attribute combination is called as a quasi-identifier. It is always critical to be able to recognize quasi-identifiers and to apply to them appropriate protective measures to mitigate the identity disclosure risk posed by join attacks.

In this paper, we start out by providing the first formal characterization and a practical technique to identify quasi-identifiers. We show an interesting connection between whether a set of columns forms a quasi-identifier and the number of distinct values assumed by the combination of the columns. We then use this characterization to come up with a probabilistic notion of anonymity. Again we show an interesting connection between the number of distinct values taken by a combination of columns and the anonymity it can offer. This allows us to find an ideal amount of generalization or suppression to apply to different columns in order to achieve probabilistic anonymity. We work through many examples and show that our analysis can be used to make a published database conform to privacy rules like HIPAA. In order to achieve probabilistic anonymity, we observe that one needs to solve multiple 1-dimensional k-anonymity problems. We propose many efficient and scalable algorithms for achieving 1-dimensional anonymity. Our algorithms are optimal in a sense that they minimally distort data and retain much of its utility.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Accuracy of the US census data, U.S. Census Bereau, http://www.census.gov/acs/www/UseData/Accuracy/Accuracy1.htm

  2. Public use microdata sample (PUMS), U.S. Census Bureau, http://www.census.gov/acs/www/Products/PUMS/

  3. Aggarwal, C.C.: On k-anonymity and the curse of dimensionality. In: Proceedings of the 2005 International Conference on Very Large Data Bases, pp. 901–909 (2005)

    Google Scholar 

  4. Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Anonymizing tables. In: Proceedings of the International Conference on Database Theory, pp. 246–258 (2005)

    Google Scholar 

  5. Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Approximation algorithms for k-Anonymity. Journal of Privacy Technology, 20051120001 (2005); Earlier version appeared in Proc. of the Intl. Conf. on Database Theory (ICDT 2005)

    Google Scholar 

  6. Aggarwal, G., Feder, T., Kenthapadi, K., Panigrahy, R., Thomas, D., Zhu, A.: Clustering for privacy. In: Proceedings of the ACM Symposium on Principles of Database Systems (2006)

    Google Scholar 

  7. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proceedings of the International Conference on Very Large Data Bases, Santiago, Chile, pp. 487–499 (September 1994)

    Google Scholar 

  8. Baum, K.: First estimates from the national crime victimization survey: Identity theft, 2004. In: Bureau of Justice Statistics Bulletin (April 2006), http://www.ojp.usdoj.gov/bjs/pub/pdf/it04.pdf

  9. Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceedings of the International Conference on Data Engineering, pp. 217–228 (2005)

    Google Scholar 

  10. Blake, C., Merz, C.: UCI repository of machine learning databases (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html

  11. Brown, M.: Identity theft victim stories: Verbal testimony by michelle brown. In: Privacy Rights Clearing House (July 2000), http://www.privacyrights.org/cases/victim9.htm

  12. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2003)

    Google Scholar 

  13. Chawla, S., Dwork, C., McSherry, F., Smith, A., Wee, H.: Toward privacy in public databases. In: 2nd Theory of Cryptography Conference (TCC), pp. 363–385 (2005)

    Google Scholar 

  14. Chawla, S., Dwork, C., McSherry, F., Talwar, K.: On the utility of privacy-preserving histograms. In: 21st Conference on Uncertainty in Artificial Intelligence (UAI) (2005)

    Google Scholar 

  15. Chernoff, H.: Asymptotic efficiency for tests based on the sums of observations. Annals of Mathematical Statistics 23, 493–507 (1952)

    Article  MathSciNet  Google Scholar 

  16. Dalenius, T.: Finding a needle in a haystack or identifying anonymous census records. Journal of Official Statistics (2), 329–336 (1986)

    Google Scholar 

  17. Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of the International Conference on Very Large Data Bases, pp. 541–550 (2001)

    Google Scholar 

  18. GLB. Gramm-Leach-Bliley Act, http://www.ftc.gov/privacy/privacyinitiatives/glbact.html

  19. HIPAA. Health Information Portability and Accountability Act, http://www.hhs.gov/ocr/hipaa/

  20. IBM. Privacy is good for business, http://www-306.ibm.com/innovation/us/customerloyalty/harriet_pearson_interview.shtml

  21. Iyengar, V.: Transforming data to satisfy privacy constraints. In: 8th ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining, pp. 279–288 (2002)

    Google Scholar 

  22. Jain, K., Vazirani, V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: Proceedings of the Annual IEEE Symposium on Foundations of Computer Science, pp. 2–13 (1999)

    Google Scholar 

  23. Lefevre, K., Dewitt, D.J., Ramakrishnan, R.: Incognito: Efficient full domain k-anonymity. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 49–60 (2005)

    Google Scholar 

  24. Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. In: Proceedings of the International Conference on Data Engineering, p. 24 (2006)

    Google Scholar 

  25. Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 251–262 (1999)

    Google Scholar 

  26. Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 223–228 (June 2004)

    Google Scholar 

  27. Munro, I., Paterson, M.: Selection and sorting with limited storage. In: Proceedings of the Annual IEEE Symposium on Foundations of Computer Science, pp. 253–258 (1978)

    Google Scholar 

  28. Rudin, W.: Real and Complex Analysis. McGraw-Hill, New York (1987)

    MATH  Google Scholar 

  29. Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information (abstract). In: Proceedings of the ACM Symposium on Principles of Database Systems, p. 188 (1998)

    Google Scholar 

  30. SOX. Sarbanes-Oxley Act, http://www.sec.gov/about/laws/soa2002.pdf

  31. Sweeney, L.: Guaranteeing anonymity when sharing medical data, the datafly system. In: Proceedings of the Journal of the American Medical Informatics Association Annual Fall Symposium, pp. 51–55 (1997)

    Google Scholar 

  32. Sweeney, L.: Three computational systems for disclosing medical data in the year 1999. In: Proceedings of MEDINFO, pp. 1124–1129 (1998)

    Google Scholar 

  33. Sweeney, L.: Uniqueness of simple demographics in the U.S. population. In: LIDAP-WP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA (2000)

    Google Scholar 

  34. Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppresion. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5), 571–588 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  35. Sweeney, L.: k-Anonymity: A model for preserving privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5), 557–570 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  36. TRDDC. Masketeer: A tool for preserving privacy, Pune (2005)

    Google Scholar 

  37. Vazirani, V.: Approximation Algorithms. Springer, Heidelberg (2004)

    Google Scholar 

  38. Vitter, J.: Random sampling with a reservoir. In: ACM Transaction on Mathematical Software, pp. 37–57 (1985)

    Google Scholar 

  39. Winkler, W.: Using simulated annealing for k-anonymity. In: Research Report 2002-07, US Census Bureau Statistical Research Division (November 2002)

    Google Scholar 

  40. Xu, Y., Motwani, R.: Random sampling based algorithms for efficient semi-key discovery (2006), http://theory.stanford.edu/~xuying/papers/minkey_vldb.pdf

Download references

Author information

Authors and Affiliations

Authors

Editor information

Francesco Bonchi Elena Ferrari Bradley Malin Yücel Saygin

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lodha, S., Thomas, D. (2008). Probabilistic Anonymity. In: Bonchi, F., Ferrari, E., Malin, B., Saygin, Y. (eds) Privacy, Security, and Trust in KDD. PInKDD 2007. Lecture Notes in Computer Science, vol 4890. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78478-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78478-4_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78477-7

  • Online ISBN: 978-3-540-78478-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics