Abstract
The k-means clustering is one of the most popular clustering algorithms in data mining. Recently a lot of research has been concentrated on the algorithm when the data-set is divided into multiple parties or when the data-set is too large to be handled by the data owner. In the latter case, usually some servers are hired to perform the task of clustering. The data set is divided by the data owner among the servers who together compute the k-means and return the cluster labels to the owner. The major challenge in this method is to prevent the servers from gaining substantial information about the actual data of the owner. Several algorithms have been designed in the past that provide cryptographic solutions to perform privacy preserving k-means. We propose a new method to perform k-means over a large set of data using multiple servers. Our technique avoids heavy cryptographic computations and instead we use a simple randomization technique to preserve the privacy of the data. The k-means computed has essentially the same efficiency and accuracy as the k-means computed over the original data-set without any randomization. We argue that our algorithm is secure against honest-but-curious and non-colluding adversary.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We are aware of several other methods to select the initial centers which may make the k-means work more efficiently. But in this work we do not concentrate on assignment of initial clusters too much.
References
Agrawal, R., Srikant, R: Privacy-preserving data mining, vol. 29. ACM (2000)
Alsabti, K., Ranka, S., Singh, V.: An efficient k-means clustering algorithm (1997)
Beye, M., Erkin, Z., Lagendijk, R.L.: Efficient privacy preserving k-means clustering in a three-party setting. In: 2011 IEEE International Workshop on Information Forensics and Security, pp. 1–6 (2011)
Boyd, C., Davies, G.T., Gjøsteen, K., Jiang, Y.: Offline assisted group key exchange. Cryptology ePrint Archive, Report 2018/114 (2018). https://eprint.iacr.org/2018/114
Bunn, P., Ostrovsky, R.: Secure two-party k-means clustering. In: Proceedings of the 14th ACM Conference on Computer and Communications Security, CCS 2007, pp. 486–497, ACM, New York (2007)
Celik, T.: Unsupervised change detection in satellite images using principal component analysis and \( k \)-means clustering. IEEE Geosci. Remote Sens. Lett. 6(4), 772–776 (2009)
Chen, K., Liu, L.: Privacy preserving data classification with rotation perturbation. In: Fifth IEEE International Conference on Data Mining (ICDM 2005), 4 p. (2005)
Cranor, L.F.: Internet privacy. Commun. ACM 42(2), 28–38 (1999)
Doganay, M.C., Pedersen, T.B., Saygin, Y., Savaş, E., Levi, A.: Distributed privacy preserving k-means clustering with additive secret sharing. In: Proceedings of the 2008 International Workshop on Privacy and Anonymity in Information Society, PAIS 2008, pp. 3–11. ACM, New York (2008)
Goldreich, O., Ron, D., Sudan, M.: Chinese remaindering with errors. In: Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, pp. 225–234. ACM (1999)
Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
Hershey, J.R., Olsen, P.A.: Approximating the Kullback Leibler divergence between Gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, p. IV–317. IEEE (2007)
Jagannathan, G., Pillaipakkamnatt, K., Wright, R.N.: A new privacy-preserving distributed k-clustering algorithm. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 494–498. SIAM (2006)
Jagannathan, G., Wright, R.N.: Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD 2005, pp. 593–599. ACM, New York (2005)
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. (7), 881–892 (2002)
Kaplan, E., Gursoy, M.E., Nergiz, M.E., Saygin, Y.: Known sample attacks on relation preserving data transformations. IEEE Trans. Dependable Secure Comput. (2017)
Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserving properties of random data perturbation techniques. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 99–106. IEEE (2003)
Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: Random-data perturbation techniques and privacy-preserving data mining. Knowl. Inf. Syst. 7(4), 387–414 (2005)
Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recogn. 36(2), 451–461 (2003)
Liu, D., Bertino, E., Yi, X.: Privacy of outsourced k-means clustering. In: Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security, ASIA CCS 2014, pp. 123–134. ACM, New York (2014)
Liu, K.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng. 18(1), 92–106 (2006)
Liu, K., Giannella, C., Kargupta, H.: An attacker’s view of distance preserving maps for privacy preserving data mining. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 297–308. Springer, Heidelberg (2006). https://doi.org/10.1007/11871637_30
Liu, K., Kargupta, H., Ryan, J.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng. 18, 92–106 (2006)
Mignotte, M.: How to share a secret. In: Beth, T. (ed.) EUROCRYPT 1982. LNCS, vol. 149, pp. 371–375. Springer, Heidelberg (1983). https://doi.org/10.1007/3-540-39466-4_27
Mitra, S.K.: On a generalised inverse of a matrix and applications. Sankhyā: Indian J. Stat. Ser. A, 107–114 (1968)
Oyelade, O.J., Oladipupo, O.O., Obagbuwa, I.C.: Application of k means clustering algorithm for prediction of students academic performance. arXiv preprint arXiv:1002.2425 (2010)
Samet, S., Miri, A., Orozco-Barbosa, L.: Privacy preserving k-means clustering in multi-party environment. In: SECRYPT (2007)
Tellaeche, A., BurgosArtizzu, X.-P., Pajares, G., Ribeiro, A.: A vision-based hybrid classifier for weeds detection in precision agriculture through the Bayesian and fuzzy k-means paradigms. In: Corchado, E., Corchado, J.M., Abraham, A. (eds.) Innovations in Hybrid Intelligent Systems. AINSC, vol. 44, pp. 72–79. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74972-1_11
Turow, J.: Americans online privacy: the system is broken (2003)
Upmanyu, M., Namboodiri, A.M., Srinathan, K., Jawahar, C.V.: Efficient privacy preserving k-means clustering. In: Chen, H., Chau, M., Li, S., Urs, S., Srinivasa, S., Wang, G.A. (eds.) PAISI 2010. LNCS, vol. 6122, pp. 154–166. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13601-6_17
Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 206–215. ACM, New York (2003)
Yu, T.-K., Lee, D.T., Chang, S.-M., Zhan, J.Z.: Multi-party k-means clustering with privacy consideration. In: International Symposium on Parallel and Distributed Processing with Applications, pp. 200–207 (2010)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendices
A Detailed Computations
1.1 A.1 Lower Bound for \(r_i\), \(1 \le i \le 2d\)
To maximize LHS and minimize RHS, we take \(\epsilon _1\)=\(\epsilon \), \(\epsilon _2\)=0, \(\epsilon _3=\epsilon \) and thus,
Using (10), we further get,
Given that \(\epsilon \) is sufficiently small, it can be safely assumed to be less than 1. Hence if (12) is satisfied, (20) is satisfied as well. Although (20) is a better bound, we use (12) to make it independent of \(\epsilon \).
1.2 A.2 Kullback Leibler Distance
The Kullback Leibler Distance (KD) is defined to be \(-\sum _i{P(i)\log {\frac{Q(i)}{P(i)}}}\) where
Without loss of generality, we assume that for \(i=1\), the above expression attains minima,
Let,
Hence,
Finally,
B Range of Bit Length of the Parameters
The probability of correctly guessing the random numbers from Eq. (1) is computed as follows. The adversary may arbitrarily fix the choice of two indices from \(\{1,\ldots ,2d\}\) for the \(r_i\)s and the corresponding index from \(\{1,\ldots ,n\}\) for the choice of \(\epsilon \). Fixing the two \(r_i\) from 2d many \(r_i\)’s can be done in \({2d\atopwithdelims ()2}\) ways. Similarly choosing one \(\epsilon _i\) from n many \(\epsilon _i\)’s can be done in n ways. Hence the probability is:
Similarly, the probability of correctly guessing from Eq. (13) is:
Fixing n and d as chosen, for the probability to be less than \(2^{-80}\), the following two equations must be satisfied,
and
Hence the above two equations give us the range for the bit length of the parameters.
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Ghosal, R., Chatterjee, S. (2018). Privacy Preserving Multi-server k-means Computation over Horizontally Partitioned Data. In: Ganapathy, V., Jaeger, T., Shyamasundar, R. (eds) Information Systems Security. ICISS 2018. Lecture Notes in Computer Science(), vol 11281. Springer, Cham. https://doi.org/10.1007/978-3-030-05171-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-05171-6_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05170-9
Online ISBN: 978-3-030-05171-6
eBook Packages: Computer ScienceComputer Science (R0)