An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach

Mojahed, Aalaa; de la Iglesia, Beatriz

doi:10.1007/s10115-016-0930-3

An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach

Regular Paper
Published: 18 March 2016

Volume 50, pages 27–52, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

552 Accesses
11 Citations
Explore all metrics

Abstract

This paper introduces Hk-medoids, a modified version of the standard k-medoids algorithm. The modification extends the algorithm for the problem of clustering complex heterogeneous objects that are described by a diversity of data types, e.g. text, images, structured data and time series. We first proposed an intermediary fusion approach to calculate fused similarities between objects, SMF, taking into account the similarities between the component elements of the objects using appropriate similarity measures. The fused approach entails uncertainty for incomplete objects or for objects which have diverging distances according to the different component. Our implementation of Hk-medoids proposed here works with the fused distances and deals with the uncertainty in the fusion process. We experimentally evaluate the potential of our proposed algorithm using five datasets with different combinations of data types that define the objects. Our results show the feasibility of the our algorithm, and also they show a performance enhancement when comparing to the application of the original SMF approach in combination with a standard k-medoids that does not take uncertainty into account. In addition, from a theoretical point of view, our proposed algorithm has lower computation complexity than the popular PAM implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Dongkuan Xu & Yingjie Tian

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

A Comprehensive Survey of Anomaly Detection Algorithms

Article 26 November 2021

Durgesh Samariya & Amit Thakkar

References

Abidi MA, Gonzalez RC (1992) Data fusion in robotics and machine intelligence. Academic Press Professional Inc, San Diego
MATH Google Scholar
Acar E, Rasmussen MA, Savorani F, Naes T, Bro R (2013) Understanding data fusion within the framework of coupled matrix and tensor factorizations. Chemom Intell Lab Syst 129:53–63 Multiway and Multiset Methods
Article Google Scholar
Akeem OA, Ogunyinka TK, Abimbola BL (2012) A framework for multimedia data mining in information technology environment. Int J Comput Sci Inf Secur (IJCSIS) 10(5):69–77
Google Scholar
Baeza-Yates RA, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston
Google Scholar
Berndt DJ, Clifford J (1996) Finding patterns in time series: a dynamic programming approach. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, Menlo Park, pp 229–248
Google Scholar
Bettencourt-Silva J, Iglesia B, Donell S, and Rayward-Smith V (2011) On creating a patient-centric database from multiple hospital information systems in a national health service secondary care setting. Methods Inf Med, 51(3):6730–6737
Bie TD, Tranchevent L-C, van Oeffelen LMM, Moreau Y (2007) Kernel-based data fusion for gene prioritization. In: ISMB/ECCB (Supplement of Bioinformatics), pp 125–132
Boström H, Andler SF, Brohede M, Johansson R, Karlsson A, van Laere J, Niklasson L, Nilsson M, Persson A, Ziemke T (2007) On the definition of information fusion as a field of research. Technical report, Institutionen för kommunikation och information
Chan TY, Partin AW, Walsh PC, Epstein JI (2000) Prognostic significance of Gleason score 3+4 versus Gleason score 4+3 tumor at radical prostatectomy. Urology 56(5):823–827
Article Google Scholar
Dasarathy BV (2003) Information fusion, data mining, and knowledge discovery. Inf Fusion 4(1):1–2
Article Google Scholar
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01, pp 269–274, New York, NY, USA. ACM
Dhillon IS, Mallela S, Modha D (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, pp 89–98
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302
Article Google Scholar
Dimitriadou E, Weingessel A, Hornik K (2002) A combination scheme for fuzzy clustering. In: Pal N, Sugeno M (eds) Advances in soft computing (AFSS 2002), vol 2275., Lecture notes in computer science, Berlin Heidelberg, Springer, pp 332–338
Faouzi N-EE, Leung H, Kurian A (2011) Data fusion in intelligent transportation systems: progress and challenges a survey. Inf Fusion 12(1):4–10 Special Issue on Intelligent Transportation Systems
Article Google Scholar
Gao B, Liu T, Zheng X, Cheng Q, Ma W (2006) Consistent bipartite graph co-partitioning for star structured high-order heterogeneous data co-clustering. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 1–31
Google (2015) Explore trends. http://www.google.com/trends/?hl=en-GB. Accessed 04 April 2015
Greene P, Cunningham P (2009) A matrix factorization approach for integrating multiple data views. In: Proceedings of the European conference on machine learning and knowledge discovery in databases: part I, pp 423–438
Hall D, Llinas J (1997) An introduction to multisensor data fusion. Proc IEEE 85(1):6–23
Article Google Scholar
Hays J, Efros AA (2007) Scene completion using millions of photographs. In: ACM SIGGRAPH, (2007) papers, SIGGRAPH ’07, New York, NY, USA. ACM
Huang A (2008) Similarity measures for text document clustering. In: Holland J, Nicholas A, Brignoli D (eds) New Zealand computer science research student conference, pp 49–56
Inc. F (2015) The world’s most powerful celebrities. http://www.forbes.com/. Accessed 24 April 2015
Jaccard S (1908) Nouvelles researches sur la distribution florale. Bull Soc Vaud Sci Nat 44:223–270
Google Scholar
Kaufman L, Rousseeuw PJ (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods. Springer, Berlin Heidelberg, pp 405–416
Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data, an introduction to cluster analysis. Wiley, New York
MATH Google Scholar
Khaleghi B, Khamis A, Karray FO, Razavi SN (2013) Multisensor data fusion: a review of the state-of-the-art. Inf Fusion 14(1):28–44
Article Google Scholar
Lanckriet GRG, Bie TD, Cristianini N, Jordan MI, Noble WS (2004a) A statistical framework for genomic data fusion. Bioinformatics 20(16):2626–2635
Article Google Scholar
Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004b) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5:27–72
MathSciNet MATH Google Scholar
Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Technical report, META Group
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’99, pp 16–22, New York, NY, USA. ACM
Li X, Wu C, Zach C, Lazebnik S, Frahm J-M (2008) Modeling and recognition of landmark image collections using iconic scene graphs. In: Proceedings of the 10th European conference on computer vision: part I, ECCV ’08, pp 427–440, Springer-Verlag, Berlin, Heidelberg
Liang P, Klein D (2009) Online EM for unsupervised models. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the Association for Computational Linguistics, NAACL ’09, pp 611–619, Stroudsburg, PA, USA
Long B, Zhang Z, Wu X, Yu PS (2006) Spectral clustering for multi-type relational data. In: ICML, pp 585–592
Ma H, Yang H, Lyu MR, King I (2008) Sorec: social recommendation using probabilistic matrix factorization. In: Proceedings of the 17th ACM conference on information and knowledge management, CIKM ’08, pp 931–940, New York, NY, USA. ACM
Manjunath TN, Hegadi RS, Ravikumar GK (2010) A survey on multimedia data mining and its relevance today. Int J Comput Sci Netw Secur (IJCSNS) 10(11):165–170
Google Scholar
Maragos P, Gros P, Katsamanis A, Papandreou G (2008) Cross-modal integration for performance improving in multimedia: a review. In: Maragos P, Potamianos A, Gros P (eds) Multimodal processing and interaction, vol 33., multimedia systems and applications, US, Springer, pp 1–46
Mojahed A (2015) Heterogeneous data: data mining solutions. http://amojahed.wix.com/heterogeneous-data. Accessed 30 Aug 2015
Mojahed A, Bettencourt-Silva J, Wang W, de la Iglesia B (2015) Applying clustering analysis to heterogeneous data using similarity matrix fusion (smf). In: Perner P (ed) Machine learning and data mining in pattern recognition, vol 9166 of lecture notes in computer science, pp 251–265. Springer International Publishing
Mojahed A, De La Iglesia B (2014) A fusion approach to computing distance for heterogeneous data. In: Proceedings of the sixth international conference on knowledge discover and information retrieval (KDIR 2014), pp 269–276, Rome, Italy. SCITEPRESS
Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th conference on VLDB, pp 144–155
NICE (2014) Prostate cancer: diagnosis and treatment. NICE Clin Guidel 175:1–48
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Article MATH Google Scholar
Park H-S, Jun C-H (2009) A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 36(2):3336–3341
Article Google Scholar
Pavlidis P, Cai J, Weston J, Noble WS (2002) Learning gene functional classifications from multiple data types. J Comput Biol 9(2):401–411
Article Google Scholar
Rand WM (1958) Objective criteria foe the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Article Google Scholar
Ratanamahatana CA, Keogh E (2005) Three myths about dynamic time warping data mining. In: Proceedings of SIAM international conference on data mining (SDM05), pp 506–510
Reuters T (2015a) ISI Web of Knowledge: Journal citation reports. http://wokinfo.com/products_tools/analytical/jcr/. Accessed 14 April 2015
Reuters T (2015b) Web of Science. http://apps.webofknowledge.com/WOS_GeneralSearch_input.do?product=WOS&SID=P1JvWUMqY5wYpc8EIER&search_mode=GeneralSearch. Accessed 14 April 2015
Salton G, McGill MJ (1987) Introduction to modern information retrieval. McGraw-Hill, New York
MATH Google Scholar
Shi Y, Falck T, Daemen A, Tranchevent L-C, Suykens JAK, De Moor B, Moreau Y (2010) L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform 11:309–332
Article Google Scholar
Society TRH (2014) Plants. https://www.rhs.org.uk/
Strehl A, Ghosh J (2003) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
MathSciNet MATH Google Scholar
Žitnik M, Zupan B (2014) Matrix factorization-based data fusion for gene function prediction in baker’s yeast and slime mold. Syst Biomed 2:1–7
Article Google Scholar
van Vliet MH, Horlings HM, van de Vijver MJ, Reinders MJT, Wessels LFA (2012) Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome. PLoS One 7(7):e40358
Article Google Scholar
Wang J, Zeng H, Chen Z, Lu H, Tao L, Ma W (2003) Recom: reinforcement clustering of multi-type interrelated data objects. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 274–281
Wikipedia (2015) Wikipedia: the free encyclopedia. https://en.wikipedia.org/wiki/Main_Page. Accessed 24 April 2015
Yu S, Moor B, Moreau Y (2009) Clustering by heterogeneous data fusion: framework and applications. In: NIPS workshop
Zeng H, Chen Z, Ma W (2002) a unified framework for clustering heterogeneous web objects. In: Proceedings of the 3rd international conference on web information systems engineering (WISE), pp 161–172
Zha H, Ding C, Gu M (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the 10th international conference on information and knowledge management, pp 25–32

Download references

Acknowledgments

We acknowledge support from Grant Number ES/L011859/1, from The Business and Local Government Data Research Centre, funded by the Economic and Social Research Council to provide economic, scientific and social researchers and business analysts with secure data services.

Author information

Authors and Affiliations

University of East Anglia, Norwich Research Park, Norwich, Norfolk, UK
Aalaa Mojahed & Beatriz de la Iglesia
Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
Aalaa Mojahed

Authors

Aalaa Mojahed
View author publications
You can also search for this author in PubMed Google Scholar
Beatriz de la Iglesia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aalaa Mojahed.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mojahed, A., de la Iglesia, B. An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach. Knowl Inf Syst 50, 27–52 (2017). https://doi.org/10.1007/s10115-016-0930-3

Download citation

Received: 07 October 2015
Revised: 01 January 2016
Accepted: 02 March 2016
Published: 18 March 2016
Issue Date: January 2017
DOI: https://doi.org/10.1007/s10115-016-0930-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Data clustering: application and trends

A Comprehensive Survey of Anomaly Detection Algorithms

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Data clustering: application and trends

A Comprehensive Survey of Anomaly Detection Algorithms

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation