Skip to main content
Log in

An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper introduces Hk-medoids, a modified version of the standard k-medoids algorithm. The modification extends the algorithm for the problem of clustering complex heterogeneous objects that are described by a diversity of data types, e.g. text, images, structured data and time series. We first proposed an intermediary fusion approach to calculate fused similarities between objects, SMF, taking into account the similarities between the component elements of the objects using appropriate similarity measures. The fused approach entails uncertainty for incomplete objects or for objects which have diverging distances according to the different component. Our implementation of Hk-medoids proposed here works with the fused distances and deals with the uncertainty in the fusion process. We experimentally evaluate the potential of our proposed algorithm using five datasets with different combinations of data types that define the objects. Our results show the feasibility of the our algorithm, and also they show a performance enhancement when comparing to the application of the original SMF approach in combination with a standard k-medoids that does not take uncertainty into account. In addition, from a theoretical point of view, our proposed algorithm has lower computation complexity than the popular PAM implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Abidi MA, Gonzalez RC (1992) Data fusion in robotics and machine intelligence. Academic Press Professional Inc, San Diego

    MATH  Google Scholar 

  2. Acar E, Rasmussen MA, Savorani F, Naes T, Bro R (2013) Understanding data fusion within the framework of coupled matrix and tensor factorizations. Chemom Intell Lab Syst 129:53–63 Multiway and Multiset Methods

    Article  Google Scholar 

  3. Akeem OA, Ogunyinka TK, Abimbola BL (2012) A framework for multimedia data mining in information technology environment. Int J Comput Sci Inf Secur (IJCSIS) 10(5):69–77

    Google Scholar 

  4. Baeza-Yates RA, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston

    Google Scholar 

  5. Berndt DJ, Clifford J (1996) Finding patterns in time series: a dynamic programming approach. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, Menlo Park, pp 229–248

    Google Scholar 

  6. Bettencourt-Silva J, Iglesia B, Donell S, and Rayward-Smith V (2011) On creating a patient-centric database from multiple hospital information systems in a national health service secondary care setting. Methods Inf Med, 51(3):6730–6737

  7. Bie TD, Tranchevent L-C, van Oeffelen LMM, Moreau Y (2007) Kernel-based data fusion for gene prioritization. In: ISMB/ECCB (Supplement of Bioinformatics), pp 125–132

  8. Boström H, Andler SF, Brohede M, Johansson R, Karlsson A, van Laere J, Niklasson L, Nilsson M, Persson A, Ziemke T (2007) On the definition of information fusion as a field of research. Technical report, Institutionen för kommunikation och information

  9. Chan TY, Partin AW, Walsh PC, Epstein JI (2000) Prognostic significance of Gleason score 3+4 versus Gleason score 4+3 tumor at radical prostatectomy. Urology 56(5):823–827

    Article  Google Scholar 

  10. Dasarathy BV (2003) Information fusion, data mining, and knowledge discovery. Inf Fusion 4(1):1–2

    Article  Google Scholar 

  11. Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01, pp 269–274, New York, NY, USA. ACM

  12. Dhillon IS, Mallela S, Modha D (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, pp 89–98

  13. Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302

    Article  Google Scholar 

  14. Dimitriadou E, Weingessel A, Hornik K (2002) A combination scheme for fuzzy clustering. In: Pal N, Sugeno M (eds) Advances in soft computing (AFSS 2002), vol 2275., Lecture notes in computer science, Berlin Heidelberg, Springer, pp 332–338

  15. Faouzi N-EE, Leung H, Kurian A (2011) Data fusion in intelligent transportation systems: progress and challenges a survey. Inf Fusion 12(1):4–10 Special Issue on Intelligent Transportation Systems

    Article  Google Scholar 

  16. Gao B, Liu T, Zheng X, Cheng Q, Ma W (2006) Consistent bipartite graph co-partitioning for star structured high-order heterogeneous data co-clustering. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 1–31

  17. Google (2015) Explore trends. http://www.google.com/trends/?hl=en-GB. Accessed 04 April 2015

  18. Greene P, Cunningham P (2009) A matrix factorization approach for integrating multiple data views. In: Proceedings of the European conference on machine learning and knowledge discovery in databases: part I, pp 423–438

  19. Hall D, Llinas J (1997) An introduction to multisensor data fusion. Proc IEEE 85(1):6–23

    Article  Google Scholar 

  20. Hays J, Efros AA (2007) Scene completion using millions of photographs. In: ACM SIGGRAPH, (2007) papers, SIGGRAPH ’07, New York, NY, USA. ACM

  21. Huang A (2008) Similarity measures for text document clustering. In: Holland J, Nicholas A, Brignoli D (eds) New Zealand computer science research student conference, pp 49–56

  22. Inc. F (2015) The world’s most powerful celebrities. http://www.forbes.com/. Accessed 24 April 2015

  23. Jaccard S (1908) Nouvelles researches sur la distribution florale. Bull Soc Vaud Sci Nat 44:223–270

    Google Scholar 

  24. Kaufman L, Rousseeuw PJ (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods. Springer, Berlin Heidelberg, pp 405–416

    Google Scholar 

  25. Kaufman L, Rousseeuw PJ (1990) Finding groups in data, an introduction to cluster analysis. Wiley, New York

    MATH  Google Scholar 

  26. Khaleghi B, Khamis A, Karray FO, Razavi SN (2013) Multisensor data fusion: a review of the state-of-the-art. Inf Fusion 14(1):28–44

    Article  Google Scholar 

  27. Lanckriet GRG, Bie TD, Cristianini N, Jordan MI, Noble WS (2004a) A statistical framework for genomic data fusion. Bioinformatics 20(16):2626–2635

    Article  Google Scholar 

  28. Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004b) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5:27–72

    MathSciNet  MATH  Google Scholar 

  29. Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Technical report, META Group

  30. Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’99, pp 16–22, New York, NY, USA. ACM

  31. Li X, Wu C, Zach C, Lazebnik S, Frahm J-M (2008) Modeling and recognition of landmark image collections using iconic scene graphs. In: Proceedings of the 10th European conference on computer vision: part I, ECCV ’08, pp 427–440, Springer-Verlag, Berlin, Heidelberg

  32. Liang P, Klein D (2009) Online EM for unsupervised models. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the Association for Computational Linguistics, NAACL ’09, pp 611–619, Stroudsburg, PA, USA

  33. Long B, Zhang Z, Wu X, Yu PS (2006) Spectral clustering for multi-type relational data. In: ICML, pp 585–592

  34. Ma H, Yang H, Lyu MR, King I (2008) Sorec: social recommendation using probabilistic matrix factorization. In: Proceedings of the 17th ACM conference on information and knowledge management, CIKM ’08, pp 931–940, New York, NY, USA. ACM

  35. Manjunath TN, Hegadi RS, Ravikumar GK (2010) A survey on multimedia data mining and its relevance today. Int J Comput Sci Netw Secur (IJCSNS) 10(11):165–170

    Google Scholar 

  36. Maragos P, Gros P, Katsamanis A, Papandreou G (2008) Cross-modal integration for performance improving in multimedia: a review. In: Maragos P, Potamianos A, Gros P (eds) Multimodal processing and interaction, vol 33., multimedia systems and applications, US, Springer, pp 1–46

  37. Mojahed A (2015) Heterogeneous data: data mining solutions. http://amojahed.wix.com/heterogeneous-data. Accessed 30 Aug 2015

  38. Mojahed A, Bettencourt-Silva J, Wang W, de la Iglesia B (2015) Applying clustering analysis to heterogeneous data using similarity matrix fusion (smf). In: Perner P (ed) Machine learning and data mining in pattern recognition, vol 9166 of lecture notes in computer science, pp 251–265. Springer International Publishing

  39. Mojahed A, De La Iglesia B (2014) A fusion approach to computing distance for heterogeneous data. In: Proceedings of the sixth international conference on knowledge discover and information retrieval (KDIR 2014), pp 269–276, Rome, Italy. SCITEPRESS

  40. Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th conference on VLDB, pp 144–155

  41. NICE (2014) Prostate cancer: diagnosis and treatment. NICE Clin Guidel 175:1–48

  42. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175

    Article  MATH  Google Scholar 

  43. Park H-S, Jun C-H (2009) A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 36(2):3336–3341

    Article  Google Scholar 

  44. Pavlidis P, Cai J, Weston J, Noble WS (2002) Learning gene functional classifications from multiple data types. J Comput Biol 9(2):401–411

    Article  Google Scholar 

  45. Rand WM (1958) Objective criteria foe the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

  46. Ratanamahatana CA, Keogh E (2005) Three myths about dynamic time warping data mining. In: Proceedings of SIAM international conference on data mining (SDM05), pp 506–510

  47. Reuters T (2015a) ISI Web of Knowledge: Journal citation reports. http://wokinfo.com/products_tools/analytical/jcr/. Accessed 14 April 2015

  48. Reuters T (2015b) Web of Science. http://apps.webofknowledge.com/WOS_GeneralSearch_input.do?product=WOS&SID=P1JvWUMqY5wYpc8EIER&search_mode=GeneralSearch. Accessed 14 April 2015

  49. Salton G, McGill MJ (1987) Introduction to modern information retrieval. McGraw-Hill, New York

    MATH  Google Scholar 

  50. Shi Y, Falck T, Daemen A, Tranchevent L-C, Suykens JAK, De Moor B, Moreau Y (2010) L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform 11:309–332

    Article  Google Scholar 

  51. Society TRH (2014) Plants. https://www.rhs.org.uk/

  52. Strehl A, Ghosh J (2003) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    MathSciNet  MATH  Google Scholar 

  53. Žitnik M, Zupan B (2014) Matrix factorization-based data fusion for gene function prediction in baker’s yeast and slime mold. Syst Biomed 2:1–7

    Article  Google Scholar 

  54. van Vliet MH, Horlings HM, van de Vijver MJ, Reinders MJT, Wessels LFA (2012) Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome. PLoS One 7(7):e40358

    Article  Google Scholar 

  55. Wang J, Zeng H, Chen Z, Lu H, Tao L, Ma W (2003) Recom: reinforcement clustering of multi-type interrelated data objects. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 274–281

  56. Wikipedia (2015) Wikipedia: the free encyclopedia. https://en.wikipedia.org/wiki/Main_Page. Accessed 24 April 2015

  57. Yu S, Moor B, Moreau Y (2009) Clustering by heterogeneous data fusion: framework and applications. In: NIPS workshop

  58. Zeng H, Chen Z, Ma W (2002) a unified framework for clustering heterogeneous web objects. In: Proceedings of the 3rd international conference on web information systems engineering (WISE), pp 161–172

  59. Zha H, Ding C, Gu M (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the 10th international conference on information and knowledge management, pp 25–32

Download references

Acknowledgments

We acknowledge support from Grant Number ES/L011859/1, from The Business and Local Government Data Research Centre, funded by the Economic and Social Research Council to provide economic, scientific and social researchers and business analysts with secure data services.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aalaa Mojahed.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mojahed, A., de la Iglesia, B. An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach. Knowl Inf Syst 50, 27–52 (2017). https://doi.org/10.1007/s10115-016-0930-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-0930-3

Keywords

Navigation