Abstract
The emergence of cloud-based storage services is opening up new avenues in data exchange and data dissemination. This has amplified the interest in right-protection mechanisms to establish ownership in the event of data leakage. Current right-protection technologies, however, rarely provide strong guarantees on dataset utility after the protection process. This work presents techniques that explicitly address this topic and provably preserve the outcome of certain mining operations. In particular, we take special care to guarantee that the outcome of hierarchical clustering operations remains the same before and after right protection. Our approach considers all prevalent hierarchical clustering variants: single-, complete-, and average-linkage. We imprint the ownership in a dataset using watermarking principles, and we derive tight bounds on the expansion/contraction of distances incurred by the process. We leverage our analysis to design fast algorithms for right protection without exhaustively searching the vast design space. Finally, because the right-protection process introduces a user-tunable distortion on the dataset, we explore the possibility of using this mechanism for data obfuscation. We quantify the tradeoff between obfuscation and utility for spatiotemporal datasets and discover very favorable characteristics of the process. An additional advantage is that when one is interested in both right-protecting and obfuscating the original data values, the proposed mechanism can accomplish both tasks simultaneously.
- Osman Abul, Francesco Bonchi, and Micro Nanni. 2008. Never walk alone: Uncertainty for anonymity in moving objects databases. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE’08). IEEE Computer Society, Washington, DC, 376--385. Google ScholarDigital Library
- Charu C. Aggarwal and Philip S. Yu. 2004. A condensation approach to privacy preserving data mining. In Advances in Database Technology - EDBT 2004, Elisa Bertino, Stavros Christodoulakis, Dimitris Plexousakis, Vassilis Christophides, Manolis Koubarakis, Klemens Böhm, and Elena Ferrari (Eds.). Lecture Notes in Computer Science, Vol. 2992. Springer, Berlin, 183--199.Google Scholar
- Charu C. Aggarwal and Philip S. Yu. 2008. A general survey of privacy-preserving data mining models and algorithms. In Privacy-Preserving Data Mining, Charu C. Aggarwal and Philip S. Yu (Eds.). Advances in Database Systems, Vol. 34. Springer, 11--52.Google Scholar
- Rakesh Agrawal and Jerry Kiernan. 2002. Watermarking relational databases. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02). VLDB Endowment, 155--166. Google ScholarDigital Library
- Claudio Agostino Ardagna, Marco Cremonini, Ernesto Damiani, Sabrina De Capitani di Vimercati, and Pierangela Samarati. 2007. Location privacy protection through obfuscation-based techniques. In Data and Applications Security XXI, Steve Barker and Gail-Joon Ahn (Eds.). Lecture Notes in Computer Science, Vol. 4602. Springer, Berlin, 47--60. Google ScholarDigital Library
- Mikhail J. Atallah, Sunil Prabhakar, Keith B. Frikken, and Radu Sion. 2004. Digital rights protection. IEEE Data Engineering Bulletin 27, 1 (2004), 19--25.Google Scholar
- Paraskevi Bassia, Ioannis Pitas, and Nikos Nikolaidis. 2001. Robust audio watermarking in the time domain. IEEE Transactions on Multimedia 3, 2 (2001), 232--241. Google ScholarDigital Library
- Steve Borgatti. 2007. Distance and Correlation. (2007). Retrieved March 30, 2014 from http://www.analytictech.com/mb876/handouts/distance_and_correlation.htm.Google Scholar
- Keke Chen and Ling Liu. 2005. Privacy preserving data classification with rotation perturbation. In Proceedings of the 5th IEEE International Conference on Data Mining. 589--592. Google ScholarDigital Library
- Rui Chen, Benjamin C. M. Fung, and Bipin C. Desai. 2011. Differentially private trajectory data publication. CoRR abs/1112.2020 (2011).Google Scholar
- Gouenou Coatrieux, Emmanuel Chazard, Régis Beuscart, and Christian Roux. 2011. Lossless watermarking of categorical attributes for verifying medical data base integrity. In Proceedings of the 33th IEEE Annual International Conference of the Engineering in Medicine and Biology Society. 8195--8198.Google ScholarCross Ref
- Eric Cope and Gianluca Antonini. 2008. Observed correlations and dependencies among operational losses in the ORX consortium database. Journal of Operational Risk 3, 4 (2008), 47--76.Google ScholarCross Ref
- Ingemar J. Cox, Joe Kilian, F. Thomson Leighton, and Talal Shamoon. 1997. Secure spread spectrum watermarking for multimedia. IEEE Transactions on Image Processing 6, 12 (1997), 1673--1687. Google ScholarDigital Library
- Daniel Defays. 1977. An efficient algorithm for a complete link method. Computer Journal 20, 4 (1977), 364--366.Google ScholarCross Ref
- Olivier Devillers and Mordecai J. Golin. 1995. Incremental algorithms for finding the convex hulls of circles and the lower envelopes of parabolas. Information Processing Letters 56, 3 (1995), 157--164. Google ScholarDigital Library
- Nick G. Duffield and Matthias Grossglauser. 2001. Trajectory sampling for direct traffic observation. IEEE/ACM Transactions on Networking 9, 3 (2001), 280--292. Google ScholarDigital Library
- Ixchel M. Faniel and Ann Zimmerman. 2011. Beyond the data deluge: A research agenda for large-scale data sharing and reuse. International Journal of Digital Curation 6, 1 (2011), 58--69.Google ScholarCross Ref
- E. B. Fowlkes and C. L. Mallows. 1983. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78, 383 (1983), 553--569.Google ScholarCross Ref
- Benjamin C. M. Fung, Ke Wang, Rui Chen, and Philip S. Yu. 2010. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys 42, 4, Article 14 (June 2010), 53 pages. Google ScholarDigital Library
- Benjamin C. M. Fung, Ke Wang, Lingyu Wang, and Patrick C. K. Hung. 2009. Privacy-preserving data publishing for cluster analysis. Data and Knowledge Engineering 68, 6 (2009), 552--575. Google ScholarDigital Library
- Roxana Geambasu, Steven D. Gribble, and Henry M. Levy. 2009. CloudViews: Communal data sharing in public clouds. In Proceedings of the 2009 Conference on Hot Topics in Cloud Computing (HotCloud’09). USENIX Association, Article 14. Google ScholarDigital Library
- Gabriel Ghinita, Panagiotis Karras, Panos Kalnis, and Nikos Mamoulis. 2009. A framework for efficient data anonymization under privacy and accuracy constraints. ACM Transactions on Database Systems 34, 2, Article 9 (July 2009), 47 pages. Google ScholarDigital Library
- Philippe Golle and Kurt Partridge. 2009. On the anonymity of home/work location pairs. In Proceedings of the 7th International Conference on Pervasive Computing (Pervasive’09). Springer-Verlag, Berlin, 390--397. Google ScholarDigital Library
- HGP 2013. All About The Human Genome Project. Retrieved from http://www.genome.gov/10001772.Google Scholar
- Geetha Jagannathan, Krishnan Pillaipakkamnatt, and Rebecca N. Wright. 2006. A new privacy-preserving distributed k-clustering algorithm. In Proceedings of the 2006 SIAM International Conference on Data Mining. 494--498.Google Scholar
- Kaifeng Jiang, Dongxu Shao, Stéphane Bressan, Thomas Kister, and Kian-Lee Tan. 2013. Publishing trajectories with differential privacy guarantees. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management (SSDBM’13). ACM, New York, NY, Article 12, 12 pages. Google ScholarDigital Library
- Hillol Kargupta, Souptik Datta, Qi Wang, and Krishnamoorthy Sivakumar. 2003. On the privacy preserving properties of random data perturbation techniques. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM’03). IEEE Computer Society, Washington, DC, 99--106. Google ScholarDigital Library
- Tiancheng Li and Ninghui Li. 2009. On the tradeoff between privacy and utility in data publishing. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, NY, 517--526. Google ScholarDigital Library
- Li Liu and Bhavani Thuraisingham. 2006. The applicability of the perturbation model-based privacy preserving data mining for real-world data. In Proceedings of the 6th IEEE International Conference on Data Mining - Workshops (ICDMW’06). IEEE Computer Society, Washington, DC, 507--512. Google ScholarDigital Library
- Claudio Lucchese, Michail Vlachos, Deepak Rajan, and Philip S. Yu. 2010. Rights protection of trajectory datasets with nearest-neighbor preservation. The VLDB Journal 19, 4 (Aug. 2010), 531--556. Google ScholarDigital Library
- Wolfgang Ludwig and Hans-Peter Klenk. 2001. Overview: A phylogenetic backbone and taxonomic framework for procaryotic systematics. In Bergey’ s Manual of Systematic Bacteriology. Springer, 49--65.Google Scholar
- Noman Mohammed, Benjamin C. M. Fung, and Mourad Debbabi. 2009. Walking in the crowd: Anonymizing trajectory data for pattern analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, NY, 1441--1444. Google ScholarDigital Library
- Marco Casassa Mont, Ilaria Matteucci, Marinella Petrocchi, and Marco Luca Sbodio. 2012. Enabling Data Sharing in the Cloud. Technical Report. HP Laboratories, Tech Report HPL-2012--22.Google Scholar
- Pierre Moulin, Mehmet Kivanç Mihcak, and Gen-Iu Lin. 2000. An information-theoretic model for image watermarking and data hiding. In Proceedings of the IEEE International Conference on Image Processing, Vol. 3. 667--670.Google ScholarCross Ref
- Shibnath Mukherjee, Zhiyuan Chen, and Aryya Gangopadhyay. 2006. A privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms. The VLDB Journal 15, 4 (Nov. 2006), 293--315. Google ScholarDigital Library
- Fionn Murtagh. 1984. Complexities of hierarchic clustering algorithms: State of the art. Computational Statistics Quarterly 1, 2 (1984), 101--113.Google Scholar
- Mehmet Ercan Nergiz, Maurizio Atzori, Yücel Saygin, and Baris Güç. 2009. Towards trajectory anonymization: A generalization-based approach. Transactions on Data Privacy 2, 1 (April 2009), 47--75. Google ScholarDigital Library
- Stanley R. M. Oliveira and Osmar R. Zaïane. 2010. Privacy preserving clustering by data transformation. Journal of Information and Data Management 1, 1 (2010), 37--52.Google Scholar
- Rupa Parameswaran and D. Blough. 2005. A robust data obfuscation approach for privacy preservation of clustered data. In Proceedings of the 2005 IEEE International Conference on Data Mining. 18--25.Google Scholar
- Christine Parent, Stefano Spaccapietra, Chiara Renso, Gennady Andrienko, Natalia Andrienko, Vania Bogorny, Maria Luisa Damiani, Aris Gkoulalas-Divanis, Jose Macedo, Nikos Pelekis, Yannis Theodoridis, and Zhixian Yan. 2013. Semantic trajectories modeling and analysis. ACM Computing Surveys 45, 4, Article 42 (Aug. 2013), 32 pages. Google ScholarDigital Library
- Michal Piorkowski, Natasa Sarafijanovic-Djukic, and Matthias Grossglauser. 2009. A parsimonious model of mobile partitioned networks with clustering. In Proceedings of the International Conference on Communication Systems and Networks and Workshops (COMSNETS’09). 1--10. Google ScholarDigital Library
- Gang Qian, Shamik Sural, Yuelong Gu, and Sakti Pramanik. 2004. Similarity between Euclidean and cosine angle distance for nearest neighbor queries. In Proceedings of the 2004 ACM Symposium on Applied Computing (SAC), Hisham Haddad, Andrea Omicini, Roger L. Wainwright, and Lorie M. Liebrock (Eds.). ACM, 1232--1237. Google ScholarDigital Library
- Robin Sibson. 1973. SLINK: An optimally efficient algorithm for the single-link cluster method. Computer Journal 16, 1 (1973), 30--34.Google ScholarCross Ref
- John Van Sickle. 1997. Using mean similarity dendrograms to evaluate classifications. Journal of Agricultural, Biological, and Environmental Statistics (1997), 370--388.Google Scholar
- Dimitrios Simitopoulos, Sotirios A. Tsaftaris, Nikolaos V. Boulgouris, and Michael G. Strintzis. 2002. Compressed-domain video watermarking of MPEG streams. In Proceedings of the IEEE International Conference on Multimedia and Expo, Vol. 1. IEEE, 569--572.Google Scholar
- Radu Sion, Mikhail Atallah, and Sunil Prabhakar. 2004. Rights protection for relational data. IEEE Transactions on Knowledge and Data Engineering 16, 12 (Dec. 2004), 1509--1525. Google ScholarDigital Library
- Radu Sion, Mikhail Atallah, and Sunil Prabhakar. 2006. Rights protection for discrete numeric streams. IEEE Transactions on Knowledge and Data Engineering 18, 5 (May 2006), 699--714. Google ScholarDigital Library
- Chaoming Song, Zehui Qu, Nicholas Blumm, and Albert-Lszl Barabsi. 2010. Limits of predictability in human mobility. Science 327, 5968 (2010), 1018--1021.Google Scholar
- Mitchell D. Swanson, Bin Zhu, Ahmed H. Tewfik, and Laurence Boney. 1998. Robust audio watermarking using perceptual masking. Signal Processing 66, 3 (1998), 337--355. Google ScholarDigital Library
- Manolis Terrovitis and Nikos Mamoulis. 2008. Privacy preservation in the publication of trajectories. In Proceedings of the the 9th International Conference on Mobile Data Management (MDM’08). IEEE Computer Society, Washington, DC, USA, 65--72. Google ScholarDigital Library
- E. Onur Turgay, Thomas B. Pedersen, Yücel Saygın, Erkay Savaş, and Albert Levi. 2008. Disclosure risks of distance preserving data transformations. In Scientific and Statistical Database Management, Bertram Ludäscher and Nikos Mamoulis (Eds.). Lecture Notes in Computer Science, Vol. 5069. Springer, Berlin, 79--94. Google ScholarDigital Library
- Jaideep Vaidya and Chris Clifton. 2003. Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03). ACM, New York, NY, 206--215. Google ScholarDigital Library
- Michail Vlachos, Marios Hadjieleftheriou, Dimitrios Gunopulos, and Eamonn Keogh. 2003. Indexing multi-dimensional time-series with support for multiple distance measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03). ACM, New York, NY, 216--225. Google ScholarDigital Library
- Aleš Žiberna and Vesna Žabkar. 2003. Application of end-users market segmentation using statistical methods. In Developments in Applied Statistics, Anuška Ferligoj and Andrej Mrvar (Eds.). metodološki zvezki - Advances in Methodology and Statistics, Vol. 19. 243--263.Google Scholar
- Stephen B. Wicker. 2012. The loss of location privacy in the cellular age. Commununications of the ACM 55, 8 (Aug. 2012), 60--68. Google ScholarDigital Library
- Raymond Chi-Wing Wong, Ada Wai-Chee Fu, Ke Wang, Philip S. Yu, and Jian Pei. 2011. Can the utility of anonymized data be used for privacy breaches? ACM Transactions on Knowledge Discovery Data 5, 3, Article 16 (Aug. 2011), 24 pages. Google ScholarDigital Library
- Xiaotong Wang Xiamu Niu, Chengyong Shao. 2006. A survey of digital vector map watermarking. International Journal of Innovative Computing, Information and Control 2, 6 (2006), 1301--1316.Google Scholar
- Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. 2011. Differential privacy via wavelet transforms. IEEE Transactions on Knowledge and Data Engineering 23, 8 (2011), 1200--1214. Google ScholarDigital Library
- Andy Yuan Xue, Rui Zhang, Yu Zheng, Xing Xie, Jianhui Yu, Yong Tang, Sapna Jain, and Jingren Zhou. 2013. DesTeller: A system for destination prediction based on trajectories with privacy protection. PVLDB 6, 12 (2013), 1198--1201. Google ScholarDigital Library
- Mingqiang Xue, Panagiotis Karras, Chedy Raïssi, and Hung Keng Pung. 2011. Utility-driven anonymization in data publishing. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM’11). 2277--2280. Google ScholarDigital Library
- Hwanjo Yu, Xiaoqian Jiang, and Jaideep Vaidya. 2006a. Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data. In Proceedings of the 2006 ACM Symposium on Applied Computing (SAC’06). 603--610. Google ScholarDigital Library
- Hwanjo Yu, Jaideep Vaidya, and Xiaoqian Jiang. 2006b. Privacy-preserving SVM classification on vertically partitioned data. In Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining (PAKDD’06). 647--656. Google ScholarDigital Library
- Jing Yuan, Yu Zheng, Xing Xie, and Guangzhong Sun. 2011. Driving with knowledge from the physical world. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). 316--324. Google ScholarDigital Library
- Jing Yuan, Yu Zheng, Chengyang Zhang, Wenlei Xie, Xing Xie, Guangzhong Sun, and Yan Huang. 2010. T-drive: Driving directions based on taxi trajectories. In Proceedings of SIGSPATIAL International Conference on Geographic Information Systems (GIS’10). 99--108. Google ScholarDigital Library
- Wenwu Zhu, Zixiang Xiong, and Ya-Qin Zhang. 1999. Multiresolution watermarking for images and video. IEEE Transactions on Circuits and Systems for Video Technology 9, 4 (1999), 545--550. Google ScholarDigital Library
- Spyros I. Zoumpoulis, Michail Vlachos, Nikolaos M. Freris, and Claudio Lucchese. 2014. Right-protected data publishing with provable distance-based mining. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2014), 2014--2028.Google ScholarCross Ref
Index Terms
- On Data Publishing with Clustering Preservation
Recommendations
Right-protected data publishing with hierarchical clustering preservation
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementThe emergence of cloud-based storage services is opening up new avenues in data exchange and data dissemination. This has amplified the interest in right-protection mechanisms for establishing ownership in case of data leakage. Current right-protection ...
Preservation of proximity privacy in publishing numerical sensitive data
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataWe identify proximity breach as a privacy threat specific to numerical sensitive attributes in anonymized data publication. Such breach occurs when an adversary concludes with high confidence that the sensitive value of a victim individual must fall in ...
Privacy-Preserving Data Publishing Based on De-clustering
ICIS '08: Proceedings of the Seventh IEEE/ACIS International Conference on Computer and Information Science (icis 2008)In recent years, privacy preservation has become a serious concern in publication of personal data because of the wide availability of personal data. In the literature, we know that the degree of privacy protection is really determined by the number of ...
Comments