Skip to main content
Log in

Divisive clustering of high dimensional data streams

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Clustering streaming data is gaining importance as automatic data acquisition technologies are deployed in diverse applications. We propose a fully incremental projected divisive clustering method for high-dimensional data streams that is motivated by high density clustering. The method is capable of identifying clusters in arbitrary subspaces, estimating the number of clusters, and detecting changes in the data distribution which necessitate a revision of the model. The empirical evaluation of the proposed method on numerous real and simulated datasets shows that it is scalable in dimension and number of clusters, is robust to noisy and irrelevant features, and is capable of handling a variety of types of non-stationarity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Aggarwal, C.C.: A survey of stream clustering algorithms. In: Aggarwal, C.C., Reddy, C. (eds.) Data Clustering: Algorithms and Applications, pp. 457–482. CRC Press, Boca Raton (2013)

    Google Scholar 

  • Aggarwal, C.C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on Very large data bases, vol. 29, pp. 81–92 (2003)

  • Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: Proceedings of the Thirtieth international conference on Very large data bases, pp. 852–863 (2004)

  • Amini, A., Saboohi, H., Wah, T.Y., Herawan, T.: Dmm-stream: A density mini-micro clustering algorithm for evolving datastreams. In: Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), 675-682 (2014)

  • Amini, A., Wah, T.Y., Saboohi, H.: On density based data streams clustering algorithms: a survey. J. Comput. Sci. Technol. 29(1), 116–141 (2014)

    Article  Google Scholar 

  • Anagnostopoulos, C., Tasoulis, D.K., Adams, N.M., Pavlidis, N.G., Hand, D.J.: Online linear and quadratic discriminant analysis with adaptive forgetting for streaming classification. Stat. Anal. Data Min. 5(2), 139–166 (2012)

    Article  MathSciNet  Google Scholar 

  • Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM Sigmod Conference, pp. 49–60 (1999)

  • Artac, M., Jogan, M., Leonardis, A.: Incremental PCA for on-line visual learning and recognition. In: Proceedings of the 16th International Conference on Pattern Recognition, vol. 3, pp. 781–784 (2002)

  • Azzalini, A., Torelli, N.: Clustering via nonparametric density estimation. Stat. Comput. 17(1), 71–80 (2007). doi:10.1007/s11222-006-9010-y

    Article  MathSciNet  Google Scholar 

  • Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1–16 (2002)

  • Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/m

  • Boley, D.: Principal direction divisive partitioning. Data Min. Knowl. Discov. 2(4), 325–344 (1998)

    Article  Google Scholar 

  • Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J.: A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Min. Knowl. Discov. 27(3), 344–371 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining, pp. 328–339 (2006)

  • Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142 (2007)

  • Cuevas, A., Febrero, M., Fraiman, R.: Cluster analysis: a further approach based on density estimation. Comput. Stat. Data Anal. 36(4), 441–459 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Cuevas, A., Fraiman, R.: A plug-in approach to support estimation. Ann. Stat. 25(6), 2300–2312 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  • Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)

  • Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)

    Article  Google Scholar 

  • Hartigan, J.A.: Clustering Algorithms. Wiley Series in Probability and Mathematical Statistics. Wiley, New York (1975)

    Google Scholar 

  • Hartigan, P.M.: Algorithm as 217: computation of the dip statistic to test for unimodality. J. R. Stat. Soc. 34(3), 320–325 (1985)

    Google Scholar 

  • Hartigan, J.A., Hartigan, P.M.: The dip test of unimodality. Ann. Stat. 13(1), 70–84 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  • Hassani, M., Kranen, P., Saini, R., Seidl, T.: Subspace anytime stream clustering. In: Proceedings of the 26th International Conference on Scientific and Statistical Database Management, p. 37 (2014)

  • Hassani, M., Spaus, P., Gaber, M.M., Seidl, T.: Density-based projected clustering of data streams. In: Proceedings of the 6th International Conference on Scalable Uncertainty Management, pp. 311–324 (2012)

  • Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice-Hall International, Upper Saddle River (1999)

    MATH  Google Scholar 

  • Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)

    Article  Google Scholar 

  • Jia, C., Tan, C., Yong, A.: A grid and density-based clustering algorithm for processing data stream. In: International Conference on Genetic and Evolutionary Computing (2008)

  • Kranen, P., Assent, I., Baldauf, C., Seidl, T.: Self-adaptive anytime stream clustering. In: IEEE International Conference on Data Mining, pp. 249–258, doi:10.1109/ICDM.2009.47 (2009)

  • Kranen, P.: Anytime algorithms for stream data mining. Diese Dissertation. RWTH Aachen University (2011)

  • Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data. 3(1), 1–58 (2009)

    Article  Google Scholar 

  • Li, Y., Xu, L.-Q., Morphett, J., Jacobs, R.: An integrated algorithm of incremental and robust pca. In: Proceedings of the International Conference on Image Processing, 1, pp. 245–248 (2009)

  • Menardi, G., Azzalini, A.: An advancement in clustering via nonparametric density estimation. Stat. Comput. 24(5), 753–767 (2014). doi:10.1007/s11222-013-9400-x

    Article  MathSciNet  MATH  Google Scholar 

  • Müller, D.W., Sawitzki, G.: Excess mass estimates and tests for multimodality. J. Am. Stat. Assoc. 86(415), 738–746 (1991)

    MathSciNet  MATH  Google Scholar 

  • Ntoutsi, I., Zimek, A., Palpanas, T., Kröger, P., Kriegel, H.P.: Density-based projected clustering over high dimensional data streams. In: Proceedings SiAM International Conference on Data Mining, pp. 987–998 (2012)

  • Pavlidis, N.G., Tasoulis, D.K., Adams, N.M., Hand, D.J.: \(\lambda \)-perceptron: an adaptive classifier for data-streams. Pattern Recognit. 44(1), 78–96 (2011)

    Article  MATH  Google Scholar 

  • Reynolds Jr, M.R., Stoumbos, Z.G.: A CUSUM chart for monitoring a proportion when inspecting continuously. J. Qual. Technol. 3(1), 87 (1999)

    Google Scholar 

  • Rigollet, P., Vert, R.: Optimal rates for plug-in estimators of density level sets. Bernoulli 15(4), 1154–1178 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Rinaldo, A., Wasserman, L.: Generalized density clustering. Ann. Stat. 38(5), 2678–2722 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Rosenberg, A., Hirschberg, J.: V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 410–420 (2007)

  • Scott, D.W.: Multivariate Density Estimation: Theory, Practice, and Visualization, vol. 383. John Wiley & Sons, New York (2009)

    MATH  Google Scholar 

  • Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., de Carvalho, A.C.P.L.F., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 13:1–13:31 (2013)

    Article  MATH  Google Scholar 

  • Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20(5), 25–47 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  • Stuetzle, W., Nugent, R.: A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Gr. Stat. 19(2), 397–418 (2010)

    Article  MathSciNet  Google Scholar 

  • Tasoulis, S.K., Tasoulis, D.K., Plagianakos, V.: Enhancing principal direction divisive clustering. Pattern Recognit. 43(10), 3391–3411 (2010)

    Article  MATH  Google Scholar 

  • Tasoulis, S.K., Tasoulis, D.K., Plagianakos, V.P.: Clustering of high dimensional data streams. In: Maglogiannis, L., Vlahavas, L. (eds.) Artificial Intelligence: Theories and Applications, pp. 223–230. Springer, Berlin (2012)

    Chapter  Google Scholar 

  • Vergara, A., Vembu, S., Ayhan, T., Ryan, M.A., Homer, M.L., Huerta, R.: Chemical gas sensor drift compensation using classifier ensembles. Sens. Actuators B 166, 320–329 (2012)

    Article  Google Scholar 

  • von Luxburg, U.: Clustering Stability. Now Publishers Inc, Hanover (2010)

    MATH  Google Scholar 

  • Weng, J., Zhang, Y., Hwang, W.-S.: Candid covariance-free incremental principal component analysis. IEEE Trans. Pattern Anal. Mach. Intell. 25(8), 1034–1040 (2003)

  • Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Conf. 25, 103–114 (1996)

    Article  Google Scholar 

  • Zhao, Y., Karypis, G.: Criterion functions for document clustering: experiments and analysis. Mach. Learn. 42, 143–175 (2001)

    Article  Google Scholar 

Download references

Acknowledgments

David Hofmeyr gratefully acknowledges funding from both the Engineering and Physical Sciences Research Council (EPSRC) and the Oppenheimer Memorial Trust.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David P. Hofmeyr.

Appendix 1: Proofs

Appendix 1: Proofs

Before we can prove Lemma 2, we require the following preliminaries.

The algorithm for computing the dip of a distribution function F constructs a unimodal distribution function G with the following properties: (i) The modal interval of G, [mM], is equal to the modal interval of the closest unimodal distribution function to F, which we denote by \(F^U\), based on the supremum norm; (ii) \(\Vert F - G\Vert _\infty = 2\Vert F - F^U\Vert _\infty \); (iii) G is the greatest convex minorant of F on \((-\infty , m]\); (iv) G is the least concave majorant of F on \([M, \infty )\). By construction, the function G is linear between its nodes. A node \(n \le m\) of G satisfies \(G(n) = \lim \inf _{x \rightarrow n}F(x)\), while a node \(n \ge M\) of G satisfies \(G(n) = \hbox {limsup}_{x \rightarrow n}F(x)\). If F is the distribution function of a discrete random variable, then G is continuous.

The function \(F^U\) can be constructed by finding appropriate values \(b<m, B>M\) s.t. \(F^U\) is equal to \(G+\hbox {Dip}(F)\) on [bm], equal to \(G-\hbox {Dip}(F)\) on [MB], linearly interpolating between G(m) and G(M) and given any appropriate tails, which we choose to be linearly decreasing/increasing to 0 and 1 respectively.

Before proving Lemma 2, we require the following preliminary result, which relies on the notion of a step linear function.

Definition 3

(Step Linear) A function f is step linear on a non-empty, compact interval \(I = [a, b]\), if

$$\begin{aligned} f(x) = \alpha + \beta \left\lfloor (x-a)\frac{n}{b-a}\right\rfloor , \quad \forall x \in I, \end{aligned}$$

for some \(\alpha , \beta \in \mathbb {R}\) and \(n \in {\mathbb {N}}\).

A step linear function is piecewise constant, and has n equally sized jumps of size \(\beta \) spaced equally on I with the final jump ocurring at b. The approximate empirical distribution function \(\tilde{F}\) (Sect. 4.2.2) is therefore step linear over the approximating intervals.

Proposition 4

Let f be step linear on an interval \(I = [a, b]\), and satisfy \(\lim _{x \rightarrow a^-}f(x) = \alpha - \beta \), where \(\alpha , \beta \) as in the above definition for f. Let g be liner on I and continuous on a neighbourhood of I. Then

$$\begin{aligned} \sup _{x \in I} \vert f(x) - g(x) \vert\le & {} \max \left\{ \hbox {limsup}_{x \rightarrow a}\vert f(x) - g(x) \vert ,\right. \\&\left. \hbox {limsup}_{x \rightarrow b} \vert f(x) - g(x) \vert \right\} . \end{aligned}$$

Proof

Let \(f_m\) and \(f^M\) be linear on a neighbourhood of I s.t. they form the closest lower and upper bounding functions of f on I respectively. Since f is step linear, we have,

$$\begin{aligned} \lim _{x \rightarrow a^-}f(x) = f_m(a),&\lim _{x \rightarrow b^-}f(x) = f_m(b),\\ f(a) = f^M(a),&f(b) = f^M(b). \end{aligned}$$

We therefore have, by above and the fact that \(g, f_m\), and \(f^M\) are linear on I,

$$\begin{aligned} \sup _{x \in I}\vert f(x) - g(x) \vert\le & {} \max \left\{ \sup _{x \in I}\vert f^M(x) - g(x)\vert ,\right. \\&\left. \sup _{x \in I}\vert f_m(x) - g(x) \vert \right\} \\= & {} \max \left\{ \vert f^M(b) - g(b)\vert ,\right. \\&\vert f^M(a) - g(a)\vert , \vert f_m(a) - g(a) \vert ,\\&\left. \vert f_m(b) - g(b)\vert \right\} \\= & {} \max \left\{ \hbox {limsup}_{x \rightarrow a}\vert f(x) - g(x) \vert ,\right. \\&\left. \hbox {limsup}_{x \rightarrow b}\vert f(x) - g(x) \vert \right\} . \end{aligned}$$

\(\square \)

We are now in a position to prove Lemma 2, which states that the dip of a compactly approximated sample, as described in Sect. 4.2.2, provides a lower bound on the dip of the true sample.

Proof of Lemma 2

Let \(I = [a, b]\) be any compact interval and \(F_I\) the empirical distribution function of \((\mathcal {X}\cap I^c) \cup \hbox {Unif}(\mathcal {X}, I)\). Assume \(\vert \mathcal {X}\cap I \vert >1\), since otherwise \(F_I = F_\mathcal {X}\) and we are done. We can assume that the endpoints of I are elements of \(\mathcal {X}\) since this defines the same uniform set. \(F_\mathcal {X}\) and \(F_I\) are therefore equal on \(\hbox {Int}(I)^c\). In fact, since \(\mathcal {X}\) consists of unique points, \(\exists \epsilon > 0\) s.t. \(F_I(x) = F_\mathcal {X}(x) \ \forall x \not \in (a+\epsilon , b-\epsilon )\). Define \(F^\prime _I\) to be equal to \(F_\mathcal {X}^U\) for \(x \not \in \hbox {Int}(I)\) and linearly interpolating between \(F_X^U(a)\) and \(F_X^U(b)\). By construction \(F^\prime _I\) is a continuous unimodal distribution function.

We now show \(\Vert F_I - F_I^\prime \Vert _\infty \le \Vert F_\mathcal {X}- F_\mathcal {X}^U\Vert _\infty \). To see this, suppose that it is not true, i.e., \(\exists x\) s.t. \(\vert F_I(x) - F_I^\prime (x)\vert > \sup _y \vert F_\mathcal {X}(y) - F_\mathcal {X}^U(y)\vert \). Clearly \(x \in \hbox {Int}(I)\) due to the equalities discussed above and the construction of \(F^\prime _I\). Because of the continuity of \(F_\mathcal {X}^U\) and \(F_I^\prime \) and the equality of \(F_\mathcal {X}\) and \(F_I\) on \((a, a+\epsilon ) \cup (b-\epsilon , b)\), we have

$$\begin{aligned} \hbox {limsup}_{y \rightarrow a} \vert F_I(y) - F^\prime _I(y) \vert = \hbox {limsup}_{y \rightarrow a}\vert F_\mathcal {X}- F^U_\mathcal {X}(x) \vert \end{aligned}$$

and

$$\begin{aligned} \hbox {limsup}_{y \rightarrow b} \vert F_I(y) - F^\prime _I(y) \vert = \hbox {limsup}_{y \rightarrow b}\vert F_\mathcal {X}- F^U_\mathcal {X}(x) \vert . \end{aligned}$$

But by Proposition 4 one of these left hand sides is at least as large as \(\vert F_I(x) - F_I^\prime (x) \vert \), leading to a contradiction.

We have shown that the addition of a single interval cannot increase the dip. We can apply the same logic to the now modified sample \((\mathcal {X}\cap I^c) \cup \hbox {Unif}(\mathcal {X}, I)\), iterating the addition of disjoint intervals to obtain a non-increasing sequence of dips. \(\square \)

In the above proof, we do not show that \(F_I^\prime \) is the closest unimodal distribution function to \(F_I\), however its existence necessitates the closest one being at least as close. Now, the sample approximations we employ still contain a full t atoms after t observations, however, they can be stored in \({\mathcal {O}}(k)\) for k intervals. We can easily show that the dip of such a sample approximation can be computed in \({\mathcal {O}}(k)\) time.

Proposition 5

The dip of a sample consisting of k uniform sets with disjoint ranges can be computed in \({\mathcal {O}}(k)\) time.

Proof

We begin by showing that there exists a unimodal distribution function which is linear on the ranges of the uniform sets and which achieves the minimal distance to the empirical distribution function of the sample.

Let F be a continuous unimodal distribution function s.t. \(\Vert F - \tilde{F}\Vert _\infty = \hbox {Dip}(\tilde{F})\). Define \(F^\prime \) similarly to in the above proof to be the continuous distribution function which is equal to F outside and at the boundaries of the intervals defining the uniform sets and linearly interpolating on them. Using the same logic, we know that \(\sup _{x}\vert F^\prime (x) - \tilde{F}(x) \vert \le \sup _x\vert F(x) - \tilde{F}(x)\vert \), hence \(\Vert F^\prime - \tilde{F}\Vert _\infty = \hbox {Dip}(\tilde{F})\).

Proposition 4 ensures that points in the interior of the intervals will not be chosen by the dip algorithm as end points of the modal interval of G, nor points at which the difference between the functions is supremal. The possible choices for these locations is therefore \({\mathcal {O}}(k)\), and the algorithm need not evaluate the functions except at the endpoints of the intervals. \(\square \)

Finally, we provide a proof of Proposition 3.

Proof of Proposition 3

For \(s > 1\) we have \(\Vert v_s - v_{s-1}\Vert = \Vert v_s\Vert \Vert v_s-v_{s-1}\Vert \ge \vert v_s \cdot (v_s - v_{s-1})\vert = \vert v_sv_{s-1}-1\vert \), since \(\Vert v_t\Vert = 1 \ \forall t\). Therefore, since \(\{v_t\}_{t=1}^\infty \) is almost surely convergent, and therefore almost surely Cauchy, we have \(v_s \cdot v_{s-1} \xrightarrow {a.s.} 1 \Rightarrow \arccos (v_s \cdot v_{s-1}) \xrightarrow {a.s.}0\). Now, we can easily show that,

$$\begin{aligned} \lambda _t \le \gamma ^{t-1}\lambda _1 + (1-\gamma )\sum _{i=1}^{t-2}\gamma ^i\arccos (v_{t-i}\cdot v_{t-i-1}). \end{aligned}$$

Take \(\epsilon > 0\) and t large enough that \(\gamma ^{t-1}\lambda _1 < \gamma \epsilon \), and \(t>k+2\), where \(k = \lfloor \log (\epsilon (1-\gamma )/2\pi )/\log (\gamma )-1\rfloor \). Consider,

$$\begin{aligned} \sum _{i=1}^{t-2}\gamma ^i\arccos (v_{t-i}\cdot v_{t-i-1})\le & {} \sum _{i=1}^{k}\arccos (v_{t-i}\cdot v_{t-i-1})\\&+\frac{\pi \gamma ^{k+1}}{1-\gamma }, \end{aligned}$$

and \(\frac{\pi \gamma ^{k+1}}{1-\gamma } \le \frac{\epsilon }{2}\). In all,

$$\begin{aligned} \lambda _t > \epsilon \Rightarrow \sum _{i=0}^k\arccos (v_{t-i}\cdot v_{t-i-1}) > \epsilon /2. \end{aligned}$$

Notice that k does not depend on t. With probability 1, for any given \(\epsilon >0\) there is a \({\mathcal {T}}\) s.t. \(T>{\mathcal {T}}\) implies \(\sum _{i=0}^k\arccos (v_{T-i}\cdot v_{T-i-1})\le \epsilon /2\), implying \(\lambda _T \le \epsilon \) for all \(T > {\mathcal {T}}\), and the result follows. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hofmeyr, D.P., Pavlidis, N.G. & Eckley, I.A. Divisive clustering of high dimensional data streams. Stat Comput 26, 1101–1120 (2016). https://doi.org/10.1007/s11222-015-9597-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-015-9597-y

Keywords

Navigation