Skip to main content

Privacy Preserving Multi-server k-means Computation over Horizontally Partitioned Data

  • Conference paper
  • First Online:
Information Systems Security (ICISS 2018)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11281))

Included in the following conference series:

Abstract

The k-means clustering is one of the most popular clustering algorithms in data mining. Recently a lot of research has been concentrated on the algorithm when the data-set is divided into multiple parties or when the data-set is too large to be handled by the data owner. In the latter case, usually some servers are hired to perform the task of clustering. The data set is divided by the data owner among the servers who together compute the k-means and return the cluster labels to the owner. The major challenge in this method is to prevent the servers from gaining substantial information about the actual data of the owner. Several algorithms have been designed in the past that provide cryptographic solutions to perform privacy preserving k-means. We propose a new method to perform k-means over a large set of data using multiple servers. Our technique avoids heavy cryptographic computations and instead we use a simple randomization technique to preserve the privacy of the data. The k-means computed has essentially the same efficiency and accuracy as the k-means computed over the original data-set without any randomization. We argue that our algorithm is secure against honest-but-curious and non-colluding adversary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We are aware of several other methods to select the initial centers which may make the k-means work more efficiently. But in this work we do not concentrate on assignment of initial clusters too much.

References

  1. Agrawal, R., Srikant, R: Privacy-preserving data mining, vol. 29. ACM (2000)

    Google Scholar 

  2. Alsabti, K., Ranka, S., Singh, V.: An efficient k-means clustering algorithm (1997)

    Google Scholar 

  3. Beye, M., Erkin, Z., Lagendijk, R.L.: Efficient privacy preserving k-means clustering in a three-party setting. In: 2011 IEEE International Workshop on Information Forensics and Security, pp. 1–6 (2011)

    Google Scholar 

  4. Boyd, C., Davies, G.T., Gjøsteen, K., Jiang, Y.: Offline assisted group key exchange. Cryptology ePrint Archive, Report 2018/114 (2018). https://eprint.iacr.org/2018/114

    Chapter  Google Scholar 

  5. Bunn, P., Ostrovsky, R.: Secure two-party k-means clustering. In: Proceedings of the 14th ACM Conference on Computer and Communications Security, CCS 2007, pp. 486–497, ACM, New York (2007)

    Google Scholar 

  6. Celik, T.: Unsupervised change detection in satellite images using principal component analysis and \( k \)-means clustering. IEEE Geosci. Remote Sens. Lett. 6(4), 772–776 (2009)

    Article  Google Scholar 

  7. Chen, K., Liu, L.: Privacy preserving data classification with rotation perturbation. In: Fifth IEEE International Conference on Data Mining (ICDM 2005), 4 p. (2005)

    Google Scholar 

  8. Cranor, L.F.: Internet privacy. Commun. ACM 42(2), 28–38 (1999)

    Article  Google Scholar 

  9. Doganay, M.C., Pedersen, T.B., Saygin, Y., Savaş, E., Levi, A.: Distributed privacy preserving k-means clustering with additive secret sharing. In: Proceedings of the 2008 International Workshop on Privacy and Anonymity in Information Society, PAIS 2008, pp. 3–11. ACM, New York (2008)

    Google Scholar 

  10. Goldreich, O., Ron, D., Sudan, M.: Chinese remaindering with errors. In: Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, pp. 225–234. ACM (1999)

    Google Scholar 

  11. Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)

    MATH  Google Scholar 

  12. Hershey, J.R., Olsen, P.A.: Approximating the Kullback Leibler divergence between Gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, p. IV–317. IEEE (2007)

    Google Scholar 

  13. Jagannathan, G., Pillaipakkamnatt, K., Wright, R.N.: A new privacy-preserving distributed k-clustering algorithm. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 494–498. SIAM (2006)

    Google Scholar 

  14. Jagannathan, G., Wright, R.N.: Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD 2005, pp. 593–599. ACM, New York (2005)

    Google Scholar 

  15. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)

    Article  Google Scholar 

  16. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. (7), 881–892 (2002)

    Article  Google Scholar 

  17. Kaplan, E., Gursoy, M.E., Nergiz, M.E., Saygin, Y.: Known sample attacks on relation preserving data transformations. IEEE Trans. Dependable Secure Comput. (2017)

    Google Scholar 

  18. Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserving properties of random data perturbation techniques. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 99–106. IEEE (2003)

    Google Scholar 

  19. Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: Random-data perturbation techniques and privacy-preserving data mining. Knowl. Inf. Syst. 7(4), 387–414 (2005)

    Article  Google Scholar 

  20. Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recogn. 36(2), 451–461 (2003)

    Article  Google Scholar 

  21. Liu, D., Bertino, E., Yi, X.: Privacy of outsourced k-means clustering. In: Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security, ASIA CCS 2014, pp. 123–134. ACM, New York (2014)

    Google Scholar 

  22. Liu, K.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng. 18(1), 92–106 (2006)

    Article  Google Scholar 

  23. Liu, K., Giannella, C., Kargupta, H.: An attacker’s view of distance preserving maps for privacy preserving data mining. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 297–308. Springer, Heidelberg (2006). https://doi.org/10.1007/11871637_30

    Chapter  Google Scholar 

  24. Liu, K., Kargupta, H., Ryan, J.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng. 18, 92–106 (2006)

    Article  Google Scholar 

  25. Mignotte, M.: How to share a secret. In: Beth, T. (ed.) EUROCRYPT 1982. LNCS, vol. 149, pp. 371–375. Springer, Heidelberg (1983). https://doi.org/10.1007/3-540-39466-4_27

    Chapter  Google Scholar 

  26. Mitra, S.K.: On a generalised inverse of a matrix and applications. Sankhyā: Indian J. Stat. Ser. A, 107–114 (1968)

    Google Scholar 

  27. Oyelade, O.J., Oladipupo, O.O., Obagbuwa, I.C.: Application of k means clustering algorithm for prediction of students academic performance. arXiv preprint arXiv:1002.2425 (2010)

  28. Samet, S., Miri, A., Orozco-Barbosa, L.: Privacy preserving k-means clustering in multi-party environment. In: SECRYPT (2007)

    Google Scholar 

  29. Tellaeche, A., BurgosArtizzu, X.-P., Pajares, G., Ribeiro, A.: A vision-based hybrid classifier for weeds detection in precision agriculture through the Bayesian and fuzzy k-means paradigms. In: Corchado, E., Corchado, J.M., Abraham, A. (eds.) Innovations in Hybrid Intelligent Systems. AINSC, vol. 44, pp. 72–79. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74972-1_11

    Chapter  Google Scholar 

  30. Turow, J.: Americans online privacy: the system is broken (2003)

    Google Scholar 

  31. Upmanyu, M., Namboodiri, A.M., Srinathan, K., Jawahar, C.V.: Efficient privacy preserving k-means clustering. In: Chen, H., Chau, M., Li, S., Urs, S., Srinivasa, S., Wang, G.A. (eds.) PAISI 2010. LNCS, vol. 6122, pp. 154–166. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13601-6_17

    Chapter  Google Scholar 

  32. Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 206–215. ACM, New York (2003)

    Google Scholar 

  33. Yu, T.-K., Lee, D.T., Chang, S.-M., Zhan, J.Z.: Multi-party k-means clustering with privacy consideration. In: International Symposium on Parallel and Distributed Processing with Applications, pp. 200–207 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Riddhi Ghosal or Sanjit Chatterjee .

Editor information

Editors and Affiliations

Appendices

A Detailed Computations

1.1 A.1 Lower Bound for \(r_i\), \(1 \le i \le 2d\)

$$\begin{aligned}&\text{ max }\root 2 \of {\sum _{i=1}^d({r_{2i-1}^2(x_{1i}-x_{2i})^2+(\epsilon _1x_{1i}-\epsilon _2x_{2i})^2+2r_{2i-1}(x_{1i}-x_{2i})(\epsilon _1x_{1i}-\epsilon _2x_{2i}))}} \\< & {} \text{ min } \root 2 \of {\sum _{i=1}^d({r_{2i-1}^2(x_{1i}-x_{3i})^2+(\epsilon _1x_{1i}-\epsilon _3x_{3i})^2+2r_{2i-1}(x_{1i}-x_{3i})(\epsilon _1x_{1i}-\epsilon _3x_{3i}))}} \end{aligned}$$

To maximize LHS and minimize RHS, we take \(\epsilon _1\)=\(\epsilon \), \(\epsilon _2\)=0, \(\epsilon _3=\epsilon \) and thus,

$$\begin{aligned}&\sum _{i=1}^d({r_{2i-1}^2(x_{1i}-x_{2i})^2+(\epsilon ^2x_{1i}^2+2r_{2i-1}(x_{1i}-x_{2i})(\epsilon x_{1i}))}\\< & {} \sum _{i=1}^d({r_{2i-1}^2(x_{1i}-x_{3i})^2+(\epsilon ^2(x_{1i}-x_{3i})^2+2r_{2i-1} \epsilon (x_{1i}-x_{3i})^2)} \end{aligned}$$

Using (10), we further get,

$$\begin{aligned}&\epsilon \sum _{i=1}^d{[x_{1i}^2-(x_{1i}-x_{3i})^2]}+2 \sum _{i=1}^d{r_{2i-1}[x_{1i}(x_{1i}-x_{2i})-(x_{1i}-x_{3i})^2]}<0\nonumber \\\Rightarrow & {} 2 \sum _{i=1}^d{r_{2i-1}[(x_{1i}-x_{3i})^2-x_{1i}(x_{1i}-x_{2i})]}>\epsilon \sum _{i=1}^d{[x_{1i}^2-(x_{1i}-x_{3i})^2]}\nonumber \\\Rightarrow & {} r> \text{ max } (\epsilon \frac{\sum _{l=1}^d{(x_{il}^2-(x_{il}-x_{kl})^2)}}{2 \sum _{l=1}^d{(x_{il}-x_{kl})^2-x_{il}(x_{ik}-x_{jl})}}), \forall i, j, k. \end{aligned}$$
(20)

Given that \(\epsilon \) is sufficiently small, it can be safely assumed to be less than 1. Hence if (12) is satisfied, (20) is satisfied as well. Although (20) is a better bound, we use (12) to make it independent of \(\epsilon \).

1.2 A.2 Kullback Leibler Distance

The Kullback Leibler Distance (KD) is defined to be \(-\sum _i{P(i)\log {\frac{Q(i)}{P(i)}}}\) where

$$P(i)=\frac{x_{1i}-x_{2i}}{x_{4i}-x_{3i}}~~~\text{ and } Q(i)=\frac{r_{2i-1}(x_{1i}-x_{2i})+\epsilon _1x_{1i}-\epsilon _2x_{2i}}{r_{2i-1}(x_{4i}-x_{3i})+\epsilon _4x_{4i}-\epsilon _3x_{3i}}.$$
$$\begin{aligned} KD= & {} -\sum _i{\frac{x_{1i}-x_{2i}}{x_{4i}-x_{3i}}\log {\frac{r_{2i-1}+\frac{\epsilon _1x_{1i}-\epsilon _2x_{2i}}{x_{1i}-x_{2i}}}{r_{2i-1}+\frac{\epsilon _4x_{4i}-\epsilon _3x_{3i}}{x_{4i}-x_{3i}}}}} \nonumber \\= & {} \sum _i{\frac{x_{2i}-x_{1i}}{x_{4i}-x_{3i}}\log {\frac{r_{2i-1}+\frac{\epsilon _1x_{1i}-\epsilon _2x_{2i}}{x_{1i}-x_{2i}}}{r_{2i-1}+\frac{\epsilon _4x_{4i}-\epsilon _3x_{3i}}{x_{4i}-x_{3i}}}}}. \end{aligned}$$
(21)

Without loss of generality, we assume that for \(i=1\), the above expression attains minima,

$$\ge d\frac{x_{21}-x_{11}}{x_{41}-x_{31}}\log {\frac{r_{1}+\frac{\epsilon _1x_{11}-\epsilon _2x_{21}}{x_{11}-x_{21}}}{r_{1}+\frac{\epsilon _4x_{41}-\epsilon _3x_{31}}{x_{41}-x_{31}}}}$$

Let,

$$D_1=\frac{KD}{ d\frac{x_{21}-x_{11}}{x_{41}-x_{31}}}\ge \log {\frac{r_{1}+\frac{\epsilon _1x_{11}-\epsilon _2x_{21}}{x_{11}-x_{21}}}{r_{1}+\frac{\epsilon _4x_{41}-\epsilon _3x_{31}}{x_{41}-x_{31}}}}.$$

Hence,

$$e^{D_1} \ge \frac{r_{1}+\frac{\epsilon _1x_{11}-\epsilon _2x_{21}}{x_{11}-x_{21}}}{r_{1}+\frac{\epsilon _4x_{41}-\epsilon _3x_{31}}{x_{41}-x_{31}}} \ge \frac{r_1-\frac{\epsilon x_{21}}{x_{11}-x_{21}}}{r_1+\frac{\epsilon x_{41}}{x_{41}-x_{31}}}.$$

Finally,

$$KD \ge d\frac{x_{11}-x_{21}}{x_{41}-x_{31}}\log { \frac{r_1+\frac{\epsilon x_{41}}{x_{41}-x_{31}}}{r_1-\frac{\epsilon x_{21}}{x_{11}-x_{21}}}}.$$

B Range of Bit Length of the Parameters

The probability of correctly guessing the random numbers from Eq. (1) is computed as follows. The adversary may arbitrarily fix the choice of two indices from \(\{1,\ldots ,2d\}\) for the \(r_i\)s and the corresponding index from \(\{1,\ldots ,n\}\) for the choice of \(\epsilon \). Fixing the two \(r_i\) from 2d many \(r_i\)’s can be done in \({2d\atopwithdelims ()2}\) ways. Similarly choosing one \(\epsilon _i\) from n many \(\epsilon _i\)’s can be done in n ways. Hence the probability is:

$${2d \atopwithdelims ()2}{n \atopwithdelims ()1} \frac{1}{2^{2\ell _1}} \frac{1}{2^{\ell _2}}.$$

Similarly, the probability of correctly guessing from Eq. (13) is:

$${2d \atopwithdelims ()1}{n \atopwithdelims ()3}\frac{1}{2^{\ell _1}} \frac{1}{2^{3\ell _2}}.$$

Fixing n and d as chosen, for the probability to be less than \(2^{-80}\), the following two equations must be satisfied,

$$\begin{aligned} 2\ell _1 +\ell _2 \ge 103, \end{aligned}$$
(22)

and

$$\begin{aligned} \ell _1+3\ell _2 \ge 130. \end{aligned}$$
(23)

Hence the above two equations give us the range for the bit length of the parameters.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ghosal, R., Chatterjee, S. (2018). Privacy Preserving Multi-server k-means Computation over Horizontally Partitioned Data. In: Ganapathy, V., Jaeger, T., Shyamasundar, R. (eds) Information Systems Security. ICISS 2018. Lecture Notes in Computer Science(), vol 11281. Springer, Cham. https://doi.org/10.1007/978-3-030-05171-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-05171-6_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-05170-9

  • Online ISBN: 978-3-030-05171-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics