Skip to main content
Log in

A novel DBSCAN with entropy and probability for mixed data

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

In big data situation, to detect clusters of different size and shape is a challenging and imperative task. Density based clustering approaches have been widely used in many areas of science due to its simplicity and the ability to detect clusters of different sizes and shapes over the last several years. With diverse conversion on categorical data, a modified version of the DBSCAN algorithm is proposed to cluster mixed data, noted as density based clustering algorithm for mixed data with integration of entropy and probability distribution (EPDCA). Optional and various conversions are provided for clustering process with adaptability. Some benchmark data sets from UCI have been selected for testing the capability and validity of EPDCA. It was shown that the clustering results of EPDCA are considerably improved, especially on automatically number of clusters formed, noise discovery and time elapsed to form clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Hsu, C.C., Huang, Y.P.: Incremental clustering of mixed data based on distance hierarchy. Expert Syst. Appl. 35(3), 1177–1185 (2008)

    Article  Google Scholar 

  2. Zhang, X., Wu, Y., Zhao, C.: MrHeter: improving MapReduce performance in heterogeneous environments. Clust. Comput. 19, 1691–1701 (2016)

    Article  Google Scholar 

  3. Kaur, A., Datta, A.: A novel algorithm for fast and scalable subspace clustering of high-dimensional data. J. Big Data 2(1), 1–24 (2015)

    Article  Google Scholar 

  4. Dutta, D., Dutta, P., Sil, J.: Simultaneous feature selection and clustering with mixed features by multi objective genetic algorithm. Int. J. Hybrid Intell. Syst. 11(1), 41–54 (2014)

    Article  Google Scholar 

  5. Sakr, S.: Cloud-hosted databases: technologies, challenges and opportunities. Clust. Comput. 17(2), 87–502 (2014)

    Article  Google Scholar 

  6. Chang, C.S., Liao, W., Chen, Y.S., et al.: A mathematical theory for clustering in metric spaces. IEEE Trans. Netw. Sci. Eng. 3(1), 2–16 (2016)

    Article  MathSciNet  Google Scholar 

  7. Parameswari, P., Samath, J.A., Saranya, P.: Efficient birch clustering algorithm for categorical and numerical data using modified co-occurrence method. Int. J. Appl. Eng. Res. 10(11), 27661–27673 (2015)

    Google Scholar 

  8. Jalal, A.S., Anant, R., Sunita, J., et al.: A density based algorithm for discovering density varied clusters in large spatial databases. Int. J. Comput. Appl. 3(6), 1–4 (2010)

    Google Scholar 

  9. Lee, J., Lee, Y.J.: An effective dissimilarity measure for clustering of high-dimensional categorical data. Knowl. Inf. Syst. 38(3), 743–757 (2014)

    Article  Google Scholar 

  10. Cao, F., Liang, J., Li, D., et al.: A dissimilarity measure for the k-modes clustering algorithm. Knowl. Based Syst. 26(9), 120–127 (2011)

    Google Scholar 

  11. Ji, J., Pang, W., Zheng, Y., et al.: A novel cluster center initialization method for the k-prototypes algorithms using centrality and distance. Appl. Math. Inf. Sci. 9(6), 2933–2942 (2015)

    Google Scholar 

  12. Lee, M., Pedrycz, W.: The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features. Fuzzy Sets Syst. 160(24), 3590–3600 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  13. Sander, J., Ester, M., Kriegel, H.P., et al.: Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining Knowl. Discov. 2(2), 169–194 (1998)

    Article  Google Scholar 

  14. Tran, T.N., Wehrens, R., Buydens, L.M.C.: KNN-kernel density-based clustering for high-dimensional multivariate data. Comput. Stat. Data Anal. 51(2), 513–525 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  15. Hinneburg, A., Keim, D.A.: A general approach to clustering in large databases with noise. Knowl. Inf. Syst. 5(4), 387–415 (2003)

    Article  Google Scholar 

  16. Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)

    Article  Google Scholar 

  17. Sugiyama, M., Niu, G., Yamada, M., et al.: Information-maximization clustering based on squared-loss mutual information. Neural Comput. 26(1), 84–131 (2014)

    Article  MathSciNet  Google Scholar 

  18. Tran, T.N., Drab, K., Daszykowski, M.: Revised DBSCAN algorithm to cluster data with dense adjacent clusters. Chemom. Intell. Lab. Syst. 120(2), 92–96 (2013)

    Article  Google Scholar 

  19. Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2001)

    Article  Google Scholar 

  20. Maulik, U., Bandyopadhyay, S., Saha, I.: Integrating clustering and supervised learning for categorical data analysis. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 40(4), 664–675 (2010)

    Article  Google Scholar 

  21. Ahmad, A., Dey, L.: A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recognit. Lett. 28(1), 110–118 (2007)

    Article  Google Scholar 

  22. Lin, J., Lin, H.: A density-based clustering over evolving heterogeneous data stream. Int. J. Digit. Content Technol. Appl. 5(6), 275–277 (2009)

    Google Scholar 

  23. Webb, J.A., Bond, N.R., Wealands, S.R., et al.: Bayesian clustering with AutoClass explicitly recognises uncertainties in landscape classification. Ecography 30(4), 526–536 (2007)

    Article  Google Scholar 

  24. Li, C., Biswas, G.: Unsupervised learning with mixed numeric and nominal data. IEEE Trans. Knowl. Data Eng. 14(4), 673–690 (2002)

    Article  Google Scholar 

  25. Xu, Z., Luo, X., Yu, J., Xu, W.: Measuring semantic similarity between words by removing noise and redundancy in web snippets. Concurr. Comput. 23(18), 2496–2510 (2011)

    Article  Google Scholar 

  26. Wikaisuksakul, S.: A multi-objective genetic algorithm with fuzzy c-means for automatic data clustering. Appl. Soft Comput. 24, 679–691 (2014)

    Article  Google Scholar 

  27. Capitaine, H.L., Frelicot, C.: A cluster-validity index combining an overlap measure and a separation measure based on fuzzy–aggregation operators. IEEE Trans. Fuzzy Syst. 19(3), 580–588 (2011)

    Article  Google Scholar 

  28. Xu, Z., Luo, X., Mei, L., Hu, C.: Measuring the semantic discrimination capability of association relations. Concurr. Comput. 26(2), 380–395 (2014)

    Article  Google Scholar 

  29. Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)

    Article  Google Scholar 

  30. Zheng Z, Gong M, Ma J, et al: Unsupervised evolutionary clustering algorithm for mixed type data. In: IEEE Congress on Evolutionary Computation, pp. 1–8 (2009)

  31. Liu, W., Luo, X., Gong, Z., Xuan, J., Kou, N., Xu, Z.: Discovering the core semantics of event from social media. Future Gener. Comput. Syst. 64, 175–185 (2016)

    Article  Google Scholar 

  32. Hsu, C.C., Chen, Y.C.: Mining of mixed data with application to catalog marketing. Expert Syst. Appl. 32(1), 12–23 (2007)

    Article  Google Scholar 

  33. Chao, J., Pang, W., Zhou, C.G.: An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120(1), 590–596 (2013)

    Google Scholar 

Download references

Acknowledgements

The authors are very grateful to the editors and reviewers for their valuable comments and suggestions. This work was supported in part by the Major Projects of the National Social Science Fund of China (Grant Nos. 16ZDA045 and 15ZDB168), National Natural Science Foundation of China (Grant Nos. 71603197, 71371148, and 91024020), Junior Fellowships for CAST Advanced Innovation Think-tank Program (DXB-ZKQN-2016-013), China Postdoctoral Science Foundation Funded Project (2016M592403).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qing Yang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, X., Yang, Q. & He, L. A novel DBSCAN with entropy and probability for mixed data. Cluster Comput 20, 1313–1323 (2017). https://doi.org/10.1007/s10586-017-0818-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-0818-3

Keywords

Navigation