Abstract
This paper introduces Hk-medoids, a modified version of the standard k-medoids algorithm. The modification extends the algorithm for the problem of clustering complex heterogeneous objects that are described by a diversity of data types, e.g. text, images, structured data and time series. We first proposed an intermediary fusion approach to calculate fused similarities between objects, SMF, taking into account the similarities between the component elements of the objects using appropriate similarity measures. The fused approach entails uncertainty for incomplete objects or for objects which have diverging distances according to the different component. Our implementation of Hk-medoids proposed here works with the fused distances and deals with the uncertainty in the fusion process. We experimentally evaluate the potential of our proposed algorithm using five datasets with different combinations of data types that define the objects. Our results show the feasibility of the our algorithm, and also they show a performance enhancement when comparing to the application of the original SMF approach in combination with a standard k-medoids that does not take uncertainty into account. In addition, from a theoretical point of view, our proposed algorithm has lower computation complexity than the popular PAM implementation.
Similar content being viewed by others
References
Abidi MA, Gonzalez RC (1992) Data fusion in robotics and machine intelligence. Academic Press Professional Inc, San Diego
Acar E, Rasmussen MA, Savorani F, Naes T, Bro R (2013) Understanding data fusion within the framework of coupled matrix and tensor factorizations. Chemom Intell Lab Syst 129:53–63 Multiway and Multiset Methods
Akeem OA, Ogunyinka TK, Abimbola BL (2012) A framework for multimedia data mining in information technology environment. Int J Comput Sci Inf Secur (IJCSIS) 10(5):69–77
Baeza-Yates RA, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston
Berndt DJ, Clifford J (1996) Finding patterns in time series: a dynamic programming approach. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, Menlo Park, pp 229–248
Bettencourt-Silva J, Iglesia B, Donell S, and Rayward-Smith V (2011) On creating a patient-centric database from multiple hospital information systems in a national health service secondary care setting. Methods Inf Med, 51(3):6730–6737
Bie TD, Tranchevent L-C, van Oeffelen LMM, Moreau Y (2007) Kernel-based data fusion for gene prioritization. In: ISMB/ECCB (Supplement of Bioinformatics), pp 125–132
Boström H, Andler SF, Brohede M, Johansson R, Karlsson A, van Laere J, Niklasson L, Nilsson M, Persson A, Ziemke T (2007) On the definition of information fusion as a field of research. Technical report, Institutionen för kommunikation och information
Chan TY, Partin AW, Walsh PC, Epstein JI (2000) Prognostic significance of Gleason score 3+4 versus Gleason score 4+3 tumor at radical prostatectomy. Urology 56(5):823–827
Dasarathy BV (2003) Information fusion, data mining, and knowledge discovery. Inf Fusion 4(1):1–2
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01, pp 269–274, New York, NY, USA. ACM
Dhillon IS, Mallela S, Modha D (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, pp 89–98
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302
Dimitriadou E, Weingessel A, Hornik K (2002) A combination scheme for fuzzy clustering. In: Pal N, Sugeno M (eds) Advances in soft computing (AFSS 2002), vol 2275., Lecture notes in computer science, Berlin Heidelberg, Springer, pp 332–338
Faouzi N-EE, Leung H, Kurian A (2011) Data fusion in intelligent transportation systems: progress and challenges a survey. Inf Fusion 12(1):4–10 Special Issue on Intelligent Transportation Systems
Gao B, Liu T, Zheng X, Cheng Q, Ma W (2006) Consistent bipartite graph co-partitioning for star structured high-order heterogeneous data co-clustering. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 1–31
Google (2015) Explore trends. http://www.google.com/trends/?hl=en-GB. Accessed 04 April 2015
Greene P, Cunningham P (2009) A matrix factorization approach for integrating multiple data views. In: Proceedings of the European conference on machine learning and knowledge discovery in databases: part I, pp 423–438
Hall D, Llinas J (1997) An introduction to multisensor data fusion. Proc IEEE 85(1):6–23
Hays J, Efros AA (2007) Scene completion using millions of photographs. In: ACM SIGGRAPH, (2007) papers, SIGGRAPH ’07, New York, NY, USA. ACM
Huang A (2008) Similarity measures for text document clustering. In: Holland J, Nicholas A, Brignoli D (eds) New Zealand computer science research student conference, pp 49–56
Inc. F (2015) The world’s most powerful celebrities. http://www.forbes.com/. Accessed 24 April 2015
Jaccard S (1908) Nouvelles researches sur la distribution florale. Bull Soc Vaud Sci Nat 44:223–270
Kaufman L, Rousseeuw PJ (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods. Springer, Berlin Heidelberg, pp 405–416
Kaufman L, Rousseeuw PJ (1990) Finding groups in data, an introduction to cluster analysis. Wiley, New York
Khaleghi B, Khamis A, Karray FO, Razavi SN (2013) Multisensor data fusion: a review of the state-of-the-art. Inf Fusion 14(1):28–44
Lanckriet GRG, Bie TD, Cristianini N, Jordan MI, Noble WS (2004a) A statistical framework for genomic data fusion. Bioinformatics 20(16):2626–2635
Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004b) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5:27–72
Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Technical report, META Group
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’99, pp 16–22, New York, NY, USA. ACM
Li X, Wu C, Zach C, Lazebnik S, Frahm J-M (2008) Modeling and recognition of landmark image collections using iconic scene graphs. In: Proceedings of the 10th European conference on computer vision: part I, ECCV ’08, pp 427–440, Springer-Verlag, Berlin, Heidelberg
Liang P, Klein D (2009) Online EM for unsupervised models. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the Association for Computational Linguistics, NAACL ’09, pp 611–619, Stroudsburg, PA, USA
Long B, Zhang Z, Wu X, Yu PS (2006) Spectral clustering for multi-type relational data. In: ICML, pp 585–592
Ma H, Yang H, Lyu MR, King I (2008) Sorec: social recommendation using probabilistic matrix factorization. In: Proceedings of the 17th ACM conference on information and knowledge management, CIKM ’08, pp 931–940, New York, NY, USA. ACM
Manjunath TN, Hegadi RS, Ravikumar GK (2010) A survey on multimedia data mining and its relevance today. Int J Comput Sci Netw Secur (IJCSNS) 10(11):165–170
Maragos P, Gros P, Katsamanis A, Papandreou G (2008) Cross-modal integration for performance improving in multimedia: a review. In: Maragos P, Potamianos A, Gros P (eds) Multimodal processing and interaction, vol 33., multimedia systems and applications, US, Springer, pp 1–46
Mojahed A (2015) Heterogeneous data: data mining solutions. http://amojahed.wix.com/heterogeneous-data. Accessed 30 Aug 2015
Mojahed A, Bettencourt-Silva J, Wang W, de la Iglesia B (2015) Applying clustering analysis to heterogeneous data using similarity matrix fusion (smf). In: Perner P (ed) Machine learning and data mining in pattern recognition, vol 9166 of lecture notes in computer science, pp 251–265. Springer International Publishing
Mojahed A, De La Iglesia B (2014) A fusion approach to computing distance for heterogeneous data. In: Proceedings of the sixth international conference on knowledge discover and information retrieval (KDIR 2014), pp 269–276, Rome, Italy. SCITEPRESS
Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th conference on VLDB, pp 144–155
NICE (2014) Prostate cancer: diagnosis and treatment. NICE Clin Guidel 175:1–48
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Park H-S, Jun C-H (2009) A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 36(2):3336–3341
Pavlidis P, Cai J, Weston J, Noble WS (2002) Learning gene functional classifications from multiple data types. J Comput Biol 9(2):401–411
Rand WM (1958) Objective criteria foe the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Ratanamahatana CA, Keogh E (2005) Three myths about dynamic time warping data mining. In: Proceedings of SIAM international conference on data mining (SDM05), pp 506–510
Reuters T (2015a) ISI Web of Knowledge: Journal citation reports. http://wokinfo.com/products_tools/analytical/jcr/. Accessed 14 April 2015
Reuters T (2015b) Web of Science. http://apps.webofknowledge.com/WOS_GeneralSearch_input.do?product=WOS&SID=P1JvWUMqY5wYpc8EIER&search_mode=GeneralSearch. Accessed 14 April 2015
Salton G, McGill MJ (1987) Introduction to modern information retrieval. McGraw-Hill, New York
Shi Y, Falck T, Daemen A, Tranchevent L-C, Suykens JAK, De Moor B, Moreau Y (2010) L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform 11:309–332
Society TRH (2014) Plants. https://www.rhs.org.uk/
Strehl A, Ghosh J (2003) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Žitnik M, Zupan B (2014) Matrix factorization-based data fusion for gene function prediction in baker’s yeast and slime mold. Syst Biomed 2:1–7
van Vliet MH, Horlings HM, van de Vijver MJ, Reinders MJT, Wessels LFA (2012) Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome. PLoS One 7(7):e40358
Wang J, Zeng H, Chen Z, Lu H, Tao L, Ma W (2003) Recom: reinforcement clustering of multi-type interrelated data objects. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 274–281
Wikipedia (2015) Wikipedia: the free encyclopedia. https://en.wikipedia.org/wiki/Main_Page. Accessed 24 April 2015
Yu S, Moor B, Moreau Y (2009) Clustering by heterogeneous data fusion: framework and applications. In: NIPS workshop
Zeng H, Chen Z, Ma W (2002) a unified framework for clustering heterogeneous web objects. In: Proceedings of the 3rd international conference on web information systems engineering (WISE), pp 161–172
Zha H, Ding C, Gu M (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the 10th international conference on information and knowledge management, pp 25–32
Acknowledgments
We acknowledge support from Grant Number ES/L011859/1, from The Business and Local Government Data Research Centre, funded by the Economic and Social Research Council to provide economic, scientific and social researchers and business analysts with secure data services.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mojahed, A., de la Iglesia, B. An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach. Knowl Inf Syst 50, 27–52 (2017). https://doi.org/10.1007/s10115-016-0930-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-0930-3