skip to main content
research-article

On Data Publishing with Clustering Preservation

Published:01 April 2015Publication History
Skip Abstract Section

Abstract

The emergence of cloud-based storage services is opening up new avenues in data exchange and data dissemination. This has amplified the interest in right-protection mechanisms to establish ownership in the event of data leakage. Current right-protection technologies, however, rarely provide strong guarantees on dataset utility after the protection process. This work presents techniques that explicitly address this topic and provably preserve the outcome of certain mining operations. In particular, we take special care to guarantee that the outcome of hierarchical clustering operations remains the same before and after right protection. Our approach considers all prevalent hierarchical clustering variants: single-, complete-, and average-linkage. We imprint the ownership in a dataset using watermarking principles, and we derive tight bounds on the expansion/contraction of distances incurred by the process. We leverage our analysis to design fast algorithms for right protection without exhaustively searching the vast design space. Finally, because the right-protection process introduces a user-tunable distortion on the dataset, we explore the possibility of using this mechanism for data obfuscation. We quantify the tradeoff between obfuscation and utility for spatiotemporal datasets and discover very favorable characteristics of the process. An additional advantage is that when one is interested in both right-protecting and obfuscating the original data values, the proposed mechanism can accomplish both tasks simultaneously.

References

  1. Osman Abul, Francesco Bonchi, and Micro Nanni. 2008. Never walk alone: Uncertainty for anonymity in moving objects databases. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE’08). IEEE Computer Society, Washington, DC, 376--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Charu C. Aggarwal and Philip S. Yu. 2004. A condensation approach to privacy preserving data mining. In Advances in Database Technology - EDBT 2004, Elisa Bertino, Stavros Christodoulakis, Dimitris Plexousakis, Vassilis Christophides, Manolis Koubarakis, Klemens Böhm, and Elena Ferrari (Eds.). Lecture Notes in Computer Science, Vol. 2992. Springer, Berlin, 183--199.Google ScholarGoogle Scholar
  3. Charu C. Aggarwal and Philip S. Yu. 2008. A general survey of privacy-preserving data mining models and algorithms. In Privacy-Preserving Data Mining, Charu C. Aggarwal and Philip S. Yu (Eds.). Advances in Database Systems, Vol. 34. Springer, 11--52.Google ScholarGoogle Scholar
  4. Rakesh Agrawal and Jerry Kiernan. 2002. Watermarking relational databases. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02). VLDB Endowment, 155--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Claudio Agostino Ardagna, Marco Cremonini, Ernesto Damiani, Sabrina De Capitani di Vimercati, and Pierangela Samarati. 2007. Location privacy protection through obfuscation-based techniques. In Data and Applications Security XXI, Steve Barker and Gail-Joon Ahn (Eds.). Lecture Notes in Computer Science, Vol. 4602. Springer, Berlin, 47--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mikhail J. Atallah, Sunil Prabhakar, Keith B. Frikken, and Radu Sion. 2004. Digital rights protection. IEEE Data Engineering Bulletin 27, 1 (2004), 19--25.Google ScholarGoogle Scholar
  7. Paraskevi Bassia, Ioannis Pitas, and Nikos Nikolaidis. 2001. Robust audio watermarking in the time domain. IEEE Transactions on Multimedia 3, 2 (2001), 232--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Steve Borgatti. 2007. Distance and Correlation. (2007). Retrieved March 30, 2014 from http://www.analytictech.com/mb876/handouts/distance_and_correlation.htm.Google ScholarGoogle Scholar
  9. Keke Chen and Ling Liu. 2005. Privacy preserving data classification with rotation perturbation. In Proceedings of the 5th IEEE International Conference on Data Mining. 589--592. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Rui Chen, Benjamin C. M. Fung, and Bipin C. Desai. 2011. Differentially private trajectory data publication. CoRR abs/1112.2020 (2011).Google ScholarGoogle Scholar
  11. Gouenou Coatrieux, Emmanuel Chazard, Régis Beuscart, and Christian Roux. 2011. Lossless watermarking of categorical attributes for verifying medical data base integrity. In Proceedings of the 33th IEEE Annual International Conference of the Engineering in Medicine and Biology Society. 8195--8198.Google ScholarGoogle ScholarCross RefCross Ref
  12. Eric Cope and Gianluca Antonini. 2008. Observed correlations and dependencies among operational losses in the ORX consortium database. Journal of Operational Risk 3, 4 (2008), 47--76.Google ScholarGoogle ScholarCross RefCross Ref
  13. Ingemar J. Cox, Joe Kilian, F. Thomson Leighton, and Talal Shamoon. 1997. Secure spread spectrum watermarking for multimedia. IEEE Transactions on Image Processing 6, 12 (1997), 1673--1687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Daniel Defays. 1977. An efficient algorithm for a complete link method. Computer Journal 20, 4 (1977), 364--366.Google ScholarGoogle ScholarCross RefCross Ref
  15. Olivier Devillers and Mordecai J. Golin. 1995. Incremental algorithms for finding the convex hulls of circles and the lower envelopes of parabolas. Information Processing Letters 56, 3 (1995), 157--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nick G. Duffield and Matthias Grossglauser. 2001. Trajectory sampling for direct traffic observation. IEEE/ACM Transactions on Networking 9, 3 (2001), 280--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ixchel M. Faniel and Ann Zimmerman. 2011. Beyond the data deluge: A research agenda for large-scale data sharing and reuse. International Journal of Digital Curation 6, 1 (2011), 58--69.Google ScholarGoogle ScholarCross RefCross Ref
  18. E. B. Fowlkes and C. L. Mallows. 1983. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78, 383 (1983), 553--569.Google ScholarGoogle ScholarCross RefCross Ref
  19. Benjamin C. M. Fung, Ke Wang, Rui Chen, and Philip S. Yu. 2010. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys 42, 4, Article 14 (June 2010), 53 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Benjamin C. M. Fung, Ke Wang, Lingyu Wang, and Patrick C. K. Hung. 2009. Privacy-preserving data publishing for cluster analysis. Data and Knowledge Engineering 68, 6 (2009), 552--575. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Roxana Geambasu, Steven D. Gribble, and Henry M. Levy. 2009. CloudViews: Communal data sharing in public clouds. In Proceedings of the 2009 Conference on Hot Topics in Cloud Computing (HotCloud’09). USENIX Association, Article 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gabriel Ghinita, Panagiotis Karras, Panos Kalnis, and Nikos Mamoulis. 2009. A framework for efficient data anonymization under privacy and accuracy constraints. ACM Transactions on Database Systems 34, 2, Article 9 (July 2009), 47 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Philippe Golle and Kurt Partridge. 2009. On the anonymity of home/work location pairs. In Proceedings of the 7th International Conference on Pervasive Computing (Pervasive’09). Springer-Verlag, Berlin, 390--397. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. HGP 2013. All About The Human Genome Project. Retrieved from http://www.genome.gov/10001772.Google ScholarGoogle Scholar
  25. Geetha Jagannathan, Krishnan Pillaipakkamnatt, and Rebecca N. Wright. 2006. A new privacy-preserving distributed k-clustering algorithm. In Proceedings of the 2006 SIAM International Conference on Data Mining. 494--498.Google ScholarGoogle Scholar
  26. Kaifeng Jiang, Dongxu Shao, Stéphane Bressan, Thomas Kister, and Kian-Lee Tan. 2013. Publishing trajectories with differential privacy guarantees. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management (SSDBM’13). ACM, New York, NY, Article 12, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Hillol Kargupta, Souptik Datta, Qi Wang, and Krishnamoorthy Sivakumar. 2003. On the privacy preserving properties of random data perturbation techniques. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM’03). IEEE Computer Society, Washington, DC, 99--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tiancheng Li and Ninghui Li. 2009. On the tradeoff between privacy and utility in data publishing. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, NY, 517--526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Li Liu and Bhavani Thuraisingham. 2006. The applicability of the perturbation model-based privacy preserving data mining for real-world data. In Proceedings of the 6th IEEE International Conference on Data Mining - Workshops (ICDMW’06). IEEE Computer Society, Washington, DC, 507--512. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Claudio Lucchese, Michail Vlachos, Deepak Rajan, and Philip S. Yu. 2010. Rights protection of trajectory datasets with nearest-neighbor preservation. The VLDB Journal 19, 4 (Aug. 2010), 531--556. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Wolfgang Ludwig and Hans-Peter Klenk. 2001. Overview: A phylogenetic backbone and taxonomic framework for procaryotic systematics. In Bergey’ s Manual of Systematic Bacteriology. Springer, 49--65.Google ScholarGoogle Scholar
  32. Noman Mohammed, Benjamin C. M. Fung, and Mourad Debbabi. 2009. Walking in the crowd: Anonymizing trajectory data for pattern analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, NY, 1441--1444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Marco Casassa Mont, Ilaria Matteucci, Marinella Petrocchi, and Marco Luca Sbodio. 2012. Enabling Data Sharing in the Cloud. Technical Report. HP Laboratories, Tech Report HPL-2012--22.Google ScholarGoogle Scholar
  34. Pierre Moulin, Mehmet Kivanç Mihcak, and Gen-Iu Lin. 2000. An information-theoretic model for image watermarking and data hiding. In Proceedings of the IEEE International Conference on Image Processing, Vol. 3. 667--670.Google ScholarGoogle ScholarCross RefCross Ref
  35. Shibnath Mukherjee, Zhiyuan Chen, and Aryya Gangopadhyay. 2006. A privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms. The VLDB Journal 15, 4 (Nov. 2006), 293--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Fionn Murtagh. 1984. Complexities of hierarchic clustering algorithms: State of the art. Computational Statistics Quarterly 1, 2 (1984), 101--113.Google ScholarGoogle Scholar
  37. Mehmet Ercan Nergiz, Maurizio Atzori, Yücel Saygin, and Baris Güç. 2009. Towards trajectory anonymization: A generalization-based approach. Transactions on Data Privacy 2, 1 (April 2009), 47--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Stanley R. M. Oliveira and Osmar R. Zaïane. 2010. Privacy preserving clustering by data transformation. Journal of Information and Data Management 1, 1 (2010), 37--52.Google ScholarGoogle Scholar
  39. Rupa Parameswaran and D. Blough. 2005. A robust data obfuscation approach for privacy preservation of clustered data. In Proceedings of the 2005 IEEE International Conference on Data Mining. 18--25.Google ScholarGoogle Scholar
  40. Christine Parent, Stefano Spaccapietra, Chiara Renso, Gennady Andrienko, Natalia Andrienko, Vania Bogorny, Maria Luisa Damiani, Aris Gkoulalas-Divanis, Jose Macedo, Nikos Pelekis, Yannis Theodoridis, and Zhixian Yan. 2013. Semantic trajectories modeling and analysis. ACM Computing Surveys 45, 4, Article 42 (Aug. 2013), 32 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Michal Piorkowski, Natasa Sarafijanovic-Djukic, and Matthias Grossglauser. 2009. A parsimonious model of mobile partitioned networks with clustering. In Proceedings of the International Conference on Communication Systems and Networks and Workshops (COMSNETS’09). 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Gang Qian, Shamik Sural, Yuelong Gu, and Sakti Pramanik. 2004. Similarity between Euclidean and cosine angle distance for nearest neighbor queries. In Proceedings of the 2004 ACM Symposium on Applied Computing (SAC), Hisham Haddad, Andrea Omicini, Roger L. Wainwright, and Lorie M. Liebrock (Eds.). ACM, 1232--1237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Robin Sibson. 1973. SLINK: An optimally efficient algorithm for the single-link cluster method. Computer Journal 16, 1 (1973), 30--34.Google ScholarGoogle ScholarCross RefCross Ref
  44. John Van Sickle. 1997. Using mean similarity dendrograms to evaluate classifications. Journal of Agricultural, Biological, and Environmental Statistics (1997), 370--388.Google ScholarGoogle Scholar
  45. Dimitrios Simitopoulos, Sotirios A. Tsaftaris, Nikolaos V. Boulgouris, and Michael G. Strintzis. 2002. Compressed-domain video watermarking of MPEG streams. In Proceedings of the IEEE International Conference on Multimedia and Expo, Vol. 1. IEEE, 569--572.Google ScholarGoogle Scholar
  46. Radu Sion, Mikhail Atallah, and Sunil Prabhakar. 2004. Rights protection for relational data. IEEE Transactions on Knowledge and Data Engineering 16, 12 (Dec. 2004), 1509--1525. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Radu Sion, Mikhail Atallah, and Sunil Prabhakar. 2006. Rights protection for discrete numeric streams. IEEE Transactions on Knowledge and Data Engineering 18, 5 (May 2006), 699--714. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Chaoming Song, Zehui Qu, Nicholas Blumm, and Albert-Lszl Barabsi. 2010. Limits of predictability in human mobility. Science 327, 5968 (2010), 1018--1021.Google ScholarGoogle Scholar
  49. Mitchell D. Swanson, Bin Zhu, Ahmed H. Tewfik, and Laurence Boney. 1998. Robust audio watermarking using perceptual masking. Signal Processing 66, 3 (1998), 337--355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Manolis Terrovitis and Nikos Mamoulis. 2008. Privacy preservation in the publication of trajectories. In Proceedings of the the 9th International Conference on Mobile Data Management (MDM’08). IEEE Computer Society, Washington, DC, USA, 65--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. E. Onur Turgay, Thomas B. Pedersen, Yücel Saygın, Erkay Savaş, and Albert Levi. 2008. Disclosure risks of distance preserving data transformations. In Scientific and Statistical Database Management, Bertram Ludäscher and Nikos Mamoulis (Eds.). Lecture Notes in Computer Science, Vol. 5069. Springer, Berlin, 79--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jaideep Vaidya and Chris Clifton. 2003. Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03). ACM, New York, NY, 206--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Michail Vlachos, Marios Hadjieleftheriou, Dimitrios Gunopulos, and Eamonn Keogh. 2003. Indexing multi-dimensional time-series with support for multiple distance measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03). ACM, New York, NY, 216--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Aleš Žiberna and Vesna Žabkar. 2003. Application of end-users market segmentation using statistical methods. In Developments in Applied Statistics, Anuška Ferligoj and Andrej Mrvar (Eds.). metodološki zvezki - Advances in Methodology and Statistics, Vol. 19. 243--263.Google ScholarGoogle Scholar
  55. Stephen B. Wicker. 2012. The loss of location privacy in the cellular age. Commununications of the ACM 55, 8 (Aug. 2012), 60--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Raymond Chi-Wing Wong, Ada Wai-Chee Fu, Ke Wang, Philip S. Yu, and Jian Pei. 2011. Can the utility of anonymized data be used for privacy breaches? ACM Transactions on Knowledge Discovery Data 5, 3, Article 16 (Aug. 2011), 24 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xiaotong Wang Xiamu Niu, Chengyong Shao. 2006. A survey of digital vector map watermarking. International Journal of Innovative Computing, Information and Control 2, 6 (2006), 1301--1316.Google ScholarGoogle Scholar
  58. Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. 2011. Differential privacy via wavelet transforms. IEEE Transactions on Knowledge and Data Engineering 23, 8 (2011), 1200--1214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Andy Yuan Xue, Rui Zhang, Yu Zheng, Xing Xie, Jianhui Yu, Yong Tang, Sapna Jain, and Jingren Zhou. 2013. DesTeller: A system for destination prediction based on trajectories with privacy protection. PVLDB 6, 12 (2013), 1198--1201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Mingqiang Xue, Panagiotis Karras, Chedy Raïssi, and Hung Keng Pung. 2011. Utility-driven anonymization in data publishing. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM’11). 2277--2280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Hwanjo Yu, Xiaoqian Jiang, and Jaideep Vaidya. 2006a. Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data. In Proceedings of the 2006 ACM Symposium on Applied Computing (SAC’06). 603--610. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Hwanjo Yu, Jaideep Vaidya, and Xiaoqian Jiang. 2006b. Privacy-preserving SVM classification on vertically partitioned data. In Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining (PAKDD’06). 647--656. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Jing Yuan, Yu Zheng, Xing Xie, and Guangzhong Sun. 2011. Driving with knowledge from the physical world. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). 316--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Jing Yuan, Yu Zheng, Chengyang Zhang, Wenlei Xie, Xing Xie, Guangzhong Sun, and Yan Huang. 2010. T-drive: Driving directions based on taxi trajectories. In Proceedings of SIGSPATIAL International Conference on Geographic Information Systems (GIS’10). 99--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Wenwu Zhu, Zixiang Xiong, and Ya-Qin Zhang. 1999. Multiresolution watermarking for images and video. IEEE Transactions on Circuits and Systems for Video Technology 9, 4 (1999), 545--550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Spyros I. Zoumpoulis, Michail Vlachos, Nikolaos M. Freris, and Claudio Lucchese. 2014. Right-protected data publishing with provable distance-based mining. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2014), 2014--2028.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. On Data Publishing with Clustering Preservation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 9, Issue 3
        TKDD Special Issue (SIGKDD'13)
        April 2015
        313 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/2737800
        Issue’s Table of Contents

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 April 2015
        • Accepted: 1 September 2014
        • Revised: 1 April 2014
        • Received: 1 November 2013
        Published in tkdd Volume 9, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader