Privacy Preserving Multi-server k-means Computation over Horizontally Partitioned Data

Ghosal, Riddhi; Chatterjee, Sanjit

doi:10.1007/978-3-030-05171-6_10

Riddhi Ghosal¹⁶ &
Sanjit Chatterjee¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11281))

Included in the following conference series:

International Conference on Information Systems Security

1049 Accesses
1 Citations

Abstract

The k-means clustering is one of the most popular clustering algorithms in data mining. Recently a lot of research has been concentrated on the algorithm when the data-set is divided into multiple parties or when the data-set is too large to be handled by the data owner. In the latter case, usually some servers are hired to perform the task of clustering. The data set is divided by the data owner among the servers who together compute the k-means and return the cluster labels to the owner. The major challenge in this method is to prevent the servers from gaining substantial information about the actual data of the owner. Several algorithms have been designed in the past that provide cryptographic solutions to perform privacy preserving k-means. We propose a new method to perform k-means over a large set of data using multiple servers. Our technique avoids heavy cryptographic computations and instead we use a simple randomization technique to preserve the privacy of the data. The k-means computed has essentially the same efficiency and accuracy as the k-means computed over the original data-set without any randomization. We argue that our algorithm is secure against honest-but-curious and non-colluding adversary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Oblivious Sampling with Applications to Two-Party k-Means Clustering

Article 12 May 2020

An Efficient Approach for Privacy Preserving Distributed K-Means Clustering in Unsecured Environment

Privacy-Preserving Accelerated Clustering for Data Encrypted by Different Keys

Notes

1.
We are aware of several other methods to select the initial centers which may make the k-means work more efficiently. But in this work we do not concentrate on assignment of initial clusters too much.

References

Agrawal, R., Srikant, R: Privacy-preserving data mining, vol. 29. ACM (2000)
Google Scholar
Alsabti, K., Ranka, S., Singh, V.: An efficient k-means clustering algorithm (1997)
Google Scholar
Beye, M., Erkin, Z., Lagendijk, R.L.: Efficient privacy preserving k-means clustering in a three-party setting. In: 2011 IEEE International Workshop on Information Forensics and Security, pp. 1–6 (2011)
Google Scholar
Boyd, C., Davies, G.T., Gjøsteen, K., Jiang, Y.: Offline assisted group key exchange. Cryptology ePrint Archive, Report 2018/114 (2018). https://eprint.iacr.org/2018/114
Chapter Google Scholar
Bunn, P., Ostrovsky, R.: Secure two-party k-means clustering. In: Proceedings of the 14th ACM Conference on Computer and Communications Security, CCS 2007, pp. 486–497, ACM, New York (2007)
Google Scholar
Celik, T.: Unsupervised change detection in satellite images using principal component analysis and $ k $-means clustering. IEEE Geosci. Remote Sens. Lett. 6(4), 772–776 (2009)
Article Google Scholar
Chen, K., Liu, L.: Privacy preserving data classification with rotation perturbation. In: Fifth IEEE International Conference on Data Mining (ICDM 2005), 4 p. (2005)
Google Scholar
Cranor, L.F.: Internet privacy. Commun. ACM 42(2), 28–38 (1999)
Article Google Scholar
Doganay, M.C., Pedersen, T.B., Saygin, Y., Savaş, E., Levi, A.: Distributed privacy preserving k-means clustering with additive secret sharing. In: Proceedings of the 2008 International Workshop on Privacy and Anonymity in Information Society, PAIS 2008, pp. 3–11. ACM, New York (2008)
Google Scholar
Goldreich, O., Ron, D., Sudan, M.: Chinese remaindering with errors. In: Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, pp. 225–234. ACM (1999)
Google Scholar
Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
MATH Google Scholar
Hershey, J.R., Olsen, P.A.: Approximating the Kullback Leibler divergence between Gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, p. IV–317. IEEE (2007)
Google Scholar
Jagannathan, G., Pillaipakkamnatt, K., Wright, R.N.: A new privacy-preserving distributed k-clustering algorithm. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 494–498. SIAM (2006)
Google Scholar
Jagannathan, G., Wright, R.N.: Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD 2005, pp. 593–599. ACM, New York (2005)
Google Scholar
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. (7), 881–892 (2002)
Article Google Scholar
Kaplan, E., Gursoy, M.E., Nergiz, M.E., Saygin, Y.: Known sample attacks on relation preserving data transformations. IEEE Trans. Dependable Secure Comput. (2017)
Google Scholar
Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserving properties of random data perturbation techniques. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 99–106. IEEE (2003)
Google Scholar
Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: Random-data perturbation techniques and privacy-preserving data mining. Knowl. Inf. Syst. 7(4), 387–414 (2005)
Article Google Scholar
Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recogn. 36(2), 451–461 (2003)
Article Google Scholar
Liu, D., Bertino, E., Yi, X.: Privacy of outsourced k-means clustering. In: Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security, ASIA CCS 2014, pp. 123–134. ACM, New York (2014)
Google Scholar
Liu, K.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng. 18(1), 92–106 (2006)
Article Google Scholar
Liu, K., Giannella, C., Kargupta, H.: An attacker’s view of distance preserving maps for privacy preserving data mining. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 297–308. Springer, Heidelberg (2006). https://doi.org/10.1007/11871637_30
Chapter Google Scholar
Liu, K., Kargupta, H., Ryan, J.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng. 18, 92–106 (2006)
Article Google Scholar
Mignotte, M.: How to share a secret. In: Beth, T. (ed.) EUROCRYPT 1982. LNCS, vol. 149, pp. 371–375. Springer, Heidelberg (1983). https://doi.org/10.1007/3-540-39466-4_27
Chapter Google Scholar
Mitra, S.K.: On a generalised inverse of a matrix and applications. Sankhyā: Indian J. Stat. Ser. A, 107–114 (1968)
Google Scholar
Oyelade, O.J., Oladipupo, O.O., Obagbuwa, I.C.: Application of k means clustering algorithm for prediction of students academic performance. arXiv preprint arXiv:1002.2425 (2010)
Samet, S., Miri, A., Orozco-Barbosa, L.: Privacy preserving k-means clustering in multi-party environment. In: SECRYPT (2007)
Google Scholar
Tellaeche, A., BurgosArtizzu, X.-P., Pajares, G., Ribeiro, A.: A vision-based hybrid classifier for weeds detection in precision agriculture through the Bayesian and fuzzy k-means paradigms. In: Corchado, E., Corchado, J.M., Abraham, A. (eds.) Innovations in Hybrid Intelligent Systems. AINSC, vol. 44, pp. 72–79. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74972-1_11
Chapter Google Scholar
Turow, J.: Americans online privacy: the system is broken (2003)
Google Scholar
Upmanyu, M., Namboodiri, A.M., Srinathan, K., Jawahar, C.V.: Efficient privacy preserving k-means clustering. In: Chen, H., Chau, M., Li, S., Urs, S., Srinivasa, S., Wang, G.A. (eds.) PAISI 2010. LNCS, vol. 6122, pp. 154–166. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13601-6_17
Chapter Google Scholar
Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 206–215. ACM, New York (2003)
Google Scholar
Yu, T.-K., Lee, D.T., Chang, S.-M., Zhan, J.Z.: Multi-party k-means clustering with privacy consideration. In: International Symposium on Parallel and Distributed Processing with Applications, pp. 200–207 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Statistical Institute, Kolkata, India
Riddhi Ghosal
Department of Computer Science and Automation, Indian Institute of Science, Bengaluru, India
Sanjit Chatterjee

Authors

Riddhi Ghosal
View author publications
You can also search for this author in PubMed Google Scholar
Sanjit Chatterjee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Riddhi Ghosal or Sanjit Chatterjee .

Editor information

Editors and Affiliations

Indian Institute of Science, Bangalore, India
Vinod Ganapathy
Pennsylvania State University, University Park, PA, USA
Trent Jaeger
Indian Institute of Technology Bombay, Mumbai, India
R.K. Shyamasundar

Appendices

A Detailed Computations

1.1 A.1 Lower Bound for $r_i$, $1 \le i \le 2d$

$$\begin{aligned}&\text{ max }\root 2 \of {\sum _{i=1}^d({r_{2i-1}^2(x_{1i}-x_{2i})^2+(\epsilon _1x_{1i}-\epsilon _2x_{2i})^2+2r_{2i-1}(x_{1i}-x_{2i})(\epsilon _1x_{1i}-\epsilon _2x_{2i}))}} \\< & {} \text{ min } \root 2 \of {\sum _{i=1}^d({r_{2i-1}^2(x_{1i}-x_{3i})^2+(\epsilon _1x_{1i}-\epsilon _3x_{3i})^2+2r_{2i-1}(x_{1i}-x_{3i})(\epsilon _1x_{1i}-\epsilon _3x_{3i}))}} \end{aligned}$$

To maximize LHS and minimize RHS, we take $\epsilon _1$=$\epsilon $, $\epsilon _2$=0, $\epsilon _3=\epsilon $ and thus,

$$\begin{aligned}&\sum _{i=1}^d({r_{2i-1}^2(x_{1i}-x_{2i})^2+(\epsilon ^2x_{1i}^2+2r_{2i-1}(x_{1i}-x_{2i})(\epsilon x_{1i}))}\\< & {} \sum _{i=1}^d({r_{2i-1}^2(x_{1i}-x_{3i})^2+(\epsilon ^2(x_{1i}-x_{3i})^2+2r_{2i-1} \epsilon (x_{1i}-x_{3i})^2)} \end{aligned}$$

Using (10), we further get,

$$\begin{aligned}&\epsilon \sum _{i=1}^d{[x_{1i}^2-(x_{1i}-x_{3i})^2]}+2 \sum _{i=1}^d{r_{2i-1}[x_{1i}(x_{1i}-x_{2i})-(x_{1i}-x_{3i})^2]}<0\nonumber \\\Rightarrow & {} 2 \sum _{i=1}^d{r_{2i-1}[(x_{1i}-x_{3i})^2-x_{1i}(x_{1i}-x_{2i})]}>\epsilon \sum _{i=1}^d{[x_{1i}^2-(x_{1i}-x_{3i})^2]}\nonumber \\\Rightarrow & {} r> \text{ max } (\epsilon \frac{\sum _{l=1}^d{(x_{il}^2-(x_{il}-x_{kl})^2)}}{2 \sum _{l=1}^d{(x_{il}-x_{kl})^2-x_{il}(x_{ik}-x_{jl})}}), \forall i, j, k. \end{aligned}$$

(20)

Given that $\epsilon $ is sufficiently small, it can be safely assumed to be less than 1. Hence if (12) is satisfied, (20) is satisfied as well. Although (20) is a better bound, we use (12) to make it independent of $\epsilon $.

1.2 A.2 Kullback Leibler Distance

The Kullback Leibler Distance (KD) is defined to be $-\sum _i{P(i)\log {\frac{Q(i)}{P(i)}}}$ where

$$P(i)=\frac{x_{1i}-x_{2i}}{x_{4i}-x_{3i}}~~~\text{ and } Q(i)=\frac{r_{2i-1}(x_{1i}-x_{2i})+\epsilon _1x_{1i}-\epsilon _2x_{2i}}{r_{2i-1}(x_{4i}-x_{3i})+\epsilon _4x_{4i}-\epsilon _3x_{3i}}.$$

$$\begin{aligned} KD= & {} -\sum _i{\frac{x_{1i}-x_{2i}}{x_{4i}-x_{3i}}\log {\frac{r_{2i-1}+\frac{\epsilon _1x_{1i}-\epsilon _2x_{2i}}{x_{1i}-x_{2i}}}{r_{2i-1}+\frac{\epsilon _4x_{4i}-\epsilon _3x_{3i}}{x_{4i}-x_{3i}}}}} \nonumber \\= & {} \sum _i{\frac{x_{2i}-x_{1i}}{x_{4i}-x_{3i}}\log {\frac{r_{2i-1}+\frac{\epsilon _1x_{1i}-\epsilon _2x_{2i}}{x_{1i}-x_{2i}}}{r_{2i-1}+\frac{\epsilon _4x_{4i}-\epsilon _3x_{3i}}{x_{4i}-x_{3i}}}}}. \end{aligned}$$

(21)

Without loss of generality, we assume that for $i=1$, the above expression attains minima,

$$\ge d\frac{x_{21}-x_{11}}{x_{41}-x_{31}}\log {\frac{r_{1}+\frac{\epsilon _1x_{11}-\epsilon _2x_{21}}{x_{11}-x_{21}}}{r_{1}+\frac{\epsilon _4x_{41}-\epsilon _3x_{31}}{x_{41}-x_{31}}}}$$

Let,

$$D_1=\frac{KD}{ d\frac{x_{21}-x_{11}}{x_{41}-x_{31}}}\ge \log {\frac{r_{1}+\frac{\epsilon _1x_{11}-\epsilon _2x_{21}}{x_{11}-x_{21}}}{r_{1}+\frac{\epsilon _4x_{41}-\epsilon _3x_{31}}{x_{41}-x_{31}}}}.$$

Hence,

$$e^{D_1} \ge \frac{r_{1}+\frac{\epsilon _1x_{11}-\epsilon _2x_{21}}{x_{11}-x_{21}}}{r_{1}+\frac{\epsilon _4x_{41}-\epsilon _3x_{31}}{x_{41}-x_{31}}} \ge \frac{r_1-\frac{\epsilon x_{21}}{x_{11}-x_{21}}}{r_1+\frac{\epsilon x_{41}}{x_{41}-x_{31}}}.$$

Finally,

$$KD \ge d\frac{x_{11}-x_{21}}{x_{41}-x_{31}}\log { \frac{r_1+\frac{\epsilon x_{41}}{x_{41}-x_{31}}}{r_1-\frac{\epsilon x_{21}}{x_{11}-x_{21}}}}.$$

B Range of Bit Length of the Parameters

The probability of correctly guessing the random numbers from Eq. (1) is computed as follows. The adversary may arbitrarily fix the choice of two indices from $\{1,\ldots ,2d\}$ for the $r_i$s and the corresponding index from $\{1,\ldots ,n\}$ for the choice of $\epsilon $. Fixing the two $r_i$ from 2d many $r_i$’s can be done in ${2d\atopwithdelims ()2}$ ways. Similarly choosing one $\epsilon _i$ from n many $\epsilon _i$’s can be done in n ways. Hence the probability is:

$${2d \atopwithdelims ()2}{n \atopwithdelims ()1} \frac{1}{2^{2\ell _1}} \frac{1}{2^{\ell _2}}.$$

Similarly, the probability of correctly guessing from Eq. (13) is:

$${2d \atopwithdelims ()1}{n \atopwithdelims ()3}\frac{1}{2^{\ell _1}} \frac{1}{2^{3\ell _2}}.$$

Fixing n and d as chosen, for the probability to be less than $2^{-80}$, the following two equations must be satisfied,

$$\begin{aligned} 2\ell _1 +\ell _2 \ge 103, \end{aligned}$$

(22)

and

$$\begin{aligned} \ell _1+3\ell _2 \ge 130. \end{aligned}$$

(23)

Hence the above two equations give us the range for the bit length of the parameters.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ghosal, R., Chatterjee, S. (2018). Privacy Preserving Multi-server k-means Computation over Horizontally Partitioned Data. In: Ganapathy, V., Jaeger, T., Shyamasundar, R. (eds) Information Systems Security. ICISS 2018. Lecture Notes in Computer Science(), vol 11281. Springer, Cham. https://doi.org/10.1007/978-3-030-05171-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-05171-6_10
Published: 05 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05170-9
Online ISBN: 978-3-030-05171-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Privacy Preserving Multi-server k-means Computation over Horizontally Partitioned Data

Abstract

Access this chapter

Similar content being viewed by others

Oblivious Sampling with Applications to Two-Party k-Means Clustering

An Efficient Approach for Privacy Preserving Distributed K-Means Clustering in Unsecured Environment

Privacy-Preserving Accelerated Clustering for Data Encrypted by Different Keys

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Appendices

A Detailed Computations

1.1 A.1 Lower Bound for \(r_i\), \(1 \le i \le 2d\)

1.2 A.2 Kullback Leibler Distance

B Range of Bit Length of the Parameters

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Privacy Preserving Multi-server k-means Computation over Horizontally Partitioned Data

Abstract

Access this chapter

Similar content being viewed by others

Oblivious Sampling with Applications to Two-Party k-Means Clustering

An Efficient Approach for Privacy Preserving Distributed K-Means Clustering in Unsecured Environment

Privacy-Preserving Accelerated Clustering for Data Encrypted by Different Keys

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Appendices

A Detailed Computations

1.1 A.1 Lower Bound for \(r_i\), \(1 \le i \le 2d\)

1.2 A.2 Kullback Leibler Distance

B Range of Bit Length of the Parameters

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation