ABSTRACT
Identifying anomalies in data is central to the advancement of science, national security, and finance. However, privacy concerns restrict our ability to analyze data. Can we lift these restrictions and accurately identify anomalies without hurting the privacy of those who contribute their data? We address this question for the most practically relevant case, where a record is considered anomalous relative to other records. We make four contributions. First, we introduce the notion of sensitive privacy, which conceptualizes what it means to privately identify anomalies. Sensitive privacy generalizes the important concept of differential privacy and is amenable to analysis. Importantly, sensitive privacy admits algorithmic constructions that provide strong and practically meaningful privacy and utility guarantees. Second, we show that differential privacy is inherently incapable of accurately and privately identifying anomalies; in this sense, our generalization is necessary. Third, we provide a general compiler that takes as input a differentially private mechanism (which has bad utility for anomaly identification) and transforms it into a sensitively private one. This compiler, which is mostly of theoretical importance, is shown to output a mechanism whose utility greatly improves over the utility of the input mechanism. As our fourth contribution we propose mechanisms for a popular definition of anomaly ((β,r)-anomaly) that (i) are guaranteed to be sensitively private, (ii) come with provable utility guarantees, and (iii) are empirically shown to have an overwhelmingly accurate performance over a range of datasets and evaluation criteria.
Supplemental Material
- Charu C Aggarwal. 2015. Outlier analysis. In Data mining. Springer, 237--263.Google Scholar
- Miguel E Andrés, Nicolás E Bordenabe, Konstantinos Chatzikokolakis, and Catuscia Palamidessi. 2013. Geo-indistinguishability: Differential privacy for location-based systems. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 901--914.Google ScholarDigital Library
- Vic Barnett and Toby Lewis. 2000. Outliers in statistical data .Wiley.Google Scholar
- Daniel M Bittner, Anand D Sarwate, and Rebecca N Wright. 2018. Using Noisy Binary Search for Differentially Private Anomaly Detection. In International Symposium on Cyber Security Cryptography and Machine Learning. Springer, 20--37.Google Scholar
- Martin Bobrow. 2013. Balancing privacy with public benefit. Nature News, Vol. 500, 7461 (2013), 123.Google ScholarCross Ref
- Jonas Böhler, Daniel Bernau, and Florian Kerschbaum. 2017. Privacy-preserving outlier detection for data streams. In IFIP Annual Conference on Data and Applications Security and Privacy. Springer, 225--238.Google ScholarCross Ref
- Centers for Medicare & Medicaid Services. 1996. The Health Insurance Portability and Accountability Act of 1996 (HIPAA). Online at http://www.cms.hhs.gov/hipaa/.Google Scholar
- Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM computing surveys (CSUR), Vol. 41, 3 (2009), 15.Google ScholarDigital Library
- Ronald Cramer, I. B. Damgård, and Jesper Buus Nielsen. 2015. Secure multiparty computation: an information-theoretic approach .Cambridge University Press.Google Scholar
- Andrea Dal Pozzolo, Olivier Caelen, Reid A Johnson, and Gianluca Bontempi. 2015. Calibrating probability with undersampling for unbalanced classification. In Computational Intelligence, 2015 IEEE Symposium Series on. IEEE, 159--166.Google ScholarCross Ref
- Alison M Darcy, Alan K Louie, and Laura Weiss Roberts. 2016. Machine learning and the profession of medicine. Jama, Vol. 315, 6 (2016), 551--552.Google ScholarCross Ref
- Yihe Dong, Samuel B Hopkins, and Jerry Li. 2019. Quantum Entropy Scoring for Fast Robust Mean Estimation and Improved Outlier Detection. arXiv preprint arXiv:1906.11366 (2019).Google Scholar
- Stelios Doudalis, Ios Kotsogiannis, Samuel Haney, Ashwin Machanavajjhala, and Sharad Mehrotra. 2017. One-sided differential privacy. arXiv preprint arXiv:1712.05888 (2017).Google Scholar
- Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
- Cynthia Dwork. 2006. Differential Privacy. In Automata, Languages and Programming, Michele Bugliesi, Bart Preneel, Vladimiro Sassone, and Ingo Wegener (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--12.Google Scholar
- Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference. ACM, 214--226.Google ScholarDigital Library
- Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In TCC. Springer, 265--284.Google Scholar
- Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, Vol. 9, 3--4 (2014), 211--407.Google Scholar
- Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. 2015. Robust traceability from trace amounts. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on. IEEE, 650--669.Google ScholarDigital Library
- Yaniv Erlich and Arvind Narayanan. 2014. Routes for breaching and protecting genetic privacy. Nature Reviews Genetics, Vol. 15, 6 (2014), 409.Google ScholarCross Ref
- David Freedman, Robert Pisani, and Roger Purves. 1998. Statistics .W.W. Norton.Google Scholar
- Machine Learning Group. 2018. Credit Card Fraud Detection. https://www.kaggle.com/mlg-ulb/creditcardfraud/home .Google Scholar
- Melissa Gymrek, Amy L McGuire, David Golan, Eran Halperin, and Yaniv Erlich. 2013. Identifying personal genomes by surname inference. Science, Vol. 339, 6117 (2013), 321--324.Google Scholar
- Xi He, Ashwin Machanavajjhala, and Bolin Ding. 2014. Blowfish privacy: Tuning privacy-utility trade-offs using policies. In Proceedings of the 2014 ACM SIGMOD. ACM, 1447--1458.Google ScholarDigital Library
- Nils Homer, Szabolcs Szelinger, Margot Redman, David Duggan, Waibhav Tembe, Jill Muehling, John V Pearson, Dietrich A Stephan, Stanley F Nelson, and David W Craig. 2008. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS genetics, Vol. 4, 8 (2008), e1000167.Google Scholar
- Marcello Ienca, Pim Haselager, and Ezekiel J Emanuel. 2018. Brain leaks and consumer neurotechnology. Nature biotechnology, Vol. 36, 9 (2018), 805--810.Google Scholar
- Ian Jolliffe. 2011. Principal component analysis. In International encyclopedia of statistical science. Springer, 1094--1096.Google Scholar
- Zach Jorgensen, Ting Yu, and Graham Cormode. 2015. Conservative or liberal? personalized differential privacy. In 2015 IEEE 31st International Conference on Data Engineering (ICDE). IEEE, 1023--1034.Google ScholarCross Ref
- Seppo Karrila, Julian Hock Ean Lee, and Greg Tucker-Kellogg. 2011. A comparison of methods for data-driven cancer outlier discovery, and an application scheme to semisupervised predictive biomarker discovery. Cancer informatics, Vol. 10 (2011), CIN--S6868.Google Scholar
- Michael Kearns, Aaron Roth, Zhiwei Steven Wu, and Grigory Yaroslavtsev. 2016. Private algorithms for the protected in social network search. Proceedings of the National Academy of Sciences, Vol. 113, 4 (2016), 913--918.Google ScholarCross Ref
- Daniel Kifer and Bing-Rong Lin. 2012. An axiomatic view of statistical privacy and utility. Journal of Privacy and Confidentiality, Vol. 4, 1 (2012), 5--49.Google ScholarCross Ref
- Daniel Kifer and Ashwin Machanavajjhala. 2014. Pufferfish: A framework for mathematical privacy definitions. ACM Transactions on Database Systems (TODS), Vol. 39, 1 (2014), 3.Google ScholarDigital Library
- Edwin M Knorr and Raymond T Ng. 1997. A Unified Notion of Outliers: Properties and Computation.. In KDD, Vol. 97. 219--222.Google Scholar
- Edwin M Knorr and Raymond T Ng. 1998. Algorithms for mining distancebased outliers in large datasets. In Proceedings of the 1998 VLDB. Citeseer, 392--403.Google Scholar
- Edward Lui and Rafael Pass. 2015. Outlier privacy. In TCC. Springer, 277--305.Google Scholar
- D Luquetti, P Claes, DK Liberton, K Daniels, KM Rosana, EE Quillen, LN Pearson, B McEvoy, M Bauchet, AA Zaidi, et al. 2014. Modeling 3D Facial Shape from DNA. PLoS Genetics, Vol. 10, 3 (2014), e1004224.Google ScholarCross Ref
- Ye Nan, Kian Ming Chai, Wee Sun Lee, and Hai Leong Chieu. 2012. Optimizing F-measure: A Tale of Two Approaches. Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Vol. 1 (06 2012).Google Scholar
- Ziad Obermeyer and Ezekiel J Emanuel. 2016. Predicting the future-big data, machine learning, and clinical medicine. The New England journal of medicine, Vol. 375, 13 (2016), 1216.Google Scholar
- Soumi Ray, Dustin S McEvoy, Skye Aaron, Thu-Trang Hickman, and Adam Wright. 2018. Using statistical anomaly detection models to find clinical decision support malfunctions. Journal of the American Medical Informatics Association (2018).Google Scholar
- Shebuti Rayana. 2016. ODDS Library. http://odds.cs.stonybrook.edu. Available at http://odds.cs.stonybrook.edu.Google Scholar
- Gordon D Schiff, Lynn A Volk, Mayya Volodarskaya, Deborah H Williams, Lake Walsh, Sara G Myers, David W Bates, and Ronen Rozenblum. 2017. Screening for medication errors using an outlier detection system. Journal of the American Medical Informatics Association, Vol. 24, 2 (2017), 281--287.Google ScholarCross Ref
Index Terms
- How to Accurately and Privately Identify Anomalies
Recommendations
Differentially private data publishing via optimal univariate microaggregation and record perturbation
AbstractWe present an approach to generate differentially private data sets that consists in adding noise to a microaggregated version of the original data set. While this idea has already been pursued in the literature to reduce the ...
Preserving Genomic Privacy via Selective Sharing
WPES'20: Proceedings of the 19th Workshop on Privacy in the Electronic SocietyAlthough genomic data has significant impact and widespread usage in medical research, it puts individuals' privacy in danger, even if they anonymously or partially share their genomic data. To address this problem, we present a framework that is ...
Pufferfish Privacy Mechanisms for Correlated Data
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataMany modern databases include personal and sensitive correlated data, such as private information on users connected together in a social network, and measurements of physical activity of single subjects across time. However, differential privacy, the ...
Comments