Similarity classifier using similarity measure derived from Yu's norms in classification of medical data sets
Introduction
In this article, I will study the suitability of similarity derived from Yu's norms used in a similarity classifier. Usually a similarity classifier uses similarity based on the Łukasiewicz structure with a generalized mean. A similarity classifier has proved to be a good method in classifying medical data sets [1]. I also test different known dimension reduction methods to see how they affect the classification results.
In this paper, I will examine how well a similarity based classifier suits the diagnosis of epidemiological data. I have also compared the results with several classifiers which give good results with these data sets. The results presented in this paper are very promising.
The data sets were taken from a UCI Repository of Machine Learning Database (available in [2]) archive. The data sets chosen were the diagnoses of erythemato-squamous diseases, diabetes of PIMA Indians women, lung cancer, two breast cancer data sets and a lymphography data set.
The differential diagnosis of erythemato-squamous diseases is a difficult problem in dermatology. The diseases in this group are psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis and pityriasis rubra pilaris. They all share the clinical features of erythema and scaling with very few differences [3]. The dermatology database examined in this study consisted of 358 cases of erythemato-squamous diseases compiled by Güvernir et al. [3]. I was able to achieve significant improvement in accuracy by applying the new similarity model compared to the stand-alone neural network model or the ANFIS model suggested in [4]. In the diagnosis of lung cancer the purpose is to recognize three different types of pathological lung cancer [5]. In a lymphography domain one attempts to distinguish normal find, metastases, malignant lymphocytes and fibrosis from each other [6]. The diagnosis of the diabetes of PIMA Indians is based on personal data and the results of medical examinations decide whether a PIMA Indian individual is diabetes positive or not. There are 768 examples from the National Institute of Diabetes and Digestive and Kidney Diseases in [7]. I also tested the classifier to two well-known breast cancer data sets originating from Wolberg and Mangasarian [8], [9], [10]. There nuclear size, shape, and texture features are used to distinguish benign from malignant breast cytology. I consider the accuracy I achieved with this data also quite satisfactory.
The first classification results of a similarity classifier were published in [11]. After this results and new developments of this classifier have always been one subject in FUZZ-IEEE conferences [11], [12], [13], [14]. In [11], the classifier used a similarity measure based on the generalized Łukasiewicz structure. In [12], the most common means were used in the similarity measure and results were presented. In the classifier similarity measures weights could also be optimized and this was studied in more detail in [15]. It was observed that the differential evolution (DE) algorithm [16] is, in most cases more appropriate for this classifier than the genetic algorithm which was originally used. The classifier was also tested with the similarity measure using a generalized mean [17] and these results combined with data preprocessing and dimension reduction methods were discussed in [1]. According to [1] the classifier also classifies medical data sets quite accurately. The stability of the classifier with respect to idealvector and weights was studied in [18]. The classifier is not limited to using the similarity measure and in [19] the results of the measures based on probabilistic equivalences are presented.
In this article, classification results using a measure derived from Yu's norm [20] is presented. I have also tested two different preprocessing methods, PCA and entropy minimization, and their effects are also discussed.
Data sets as diverse as possible were chosen so that the properties of the classifier would be apparent. The data sets were taken from the UCI-Repository of Machine Learning Database [2] so that they were differently distributed and their dimensions varied. The classifier was implemented with the -software.
Section snippets
Mathematical background
The general forms of the intersection and union are represented by triangular norms (T-norms) and triangular conorms (T-conorms or S-norms), respectively.
The T-norm is a two-place function from to satisfying the following criteria: The conditions defining an S-norm (T-conorm), Sn: to [0,1], are:
Dimension reduction methods
High-dimensional data sets present many mathematical challenges as well as some opportunities, and are bound to give rise to new theoretical developments [25]. One of the problems with high-dimensional data sets is that, in many cases, not all the measured variables are “important” for understanding the underlying phenomena of interest. In mathematical terms, the problem we investigate in dimension reduction can be stated as follows: given the r-dimensional random variable , find a
Classifier
The problem of classification is basically one of partitioning the feature space into regions, one region for each category. Ideally, one would like to arrange this partitioning so that none of the decisions is ever wrong [30].
We would like to classify a set X of objects into the N different classes by their features. We suppose that D is the number of different kinds of features that we can measure from objects. We suppose that the values for the magnitude of each feature are
Results with the data sets
All six data sets were split in half; one half was used for training and the other half for testing the classifier. The training sets were randomly created 100 times for each fixed value. In Table 1, properties of the data sets can be seen: one can see how many classes there are, how many dimensions data sets have and how many samples there are in the data sets. Next, the data sets and classification results with the data sets are presented. Then the parameter value investigation is carried
Conclusion
The similarity measure derived from Yu's norms was used in the similarity classifier, which was tested with six different medical data sets. The results were better than those previously achieved with the similarity classifier when similarity based on the Łukasiewicz structure was used.
Another important finding was that when preprocessing methods were investigated with the similarity classifier using a measure based on Yu's norms, entropy minimization usually seemed to give the best results. In
Pasi Luukka graduated from Lappeenranta University of Technology in 1999. He took his M.S. degree from Information Technology in 1999 and Doctor of Technology degree in applied mathematics in 2005. His areas of interest are fuzzy data analysis, classification, artificial intelligence and evolutionary algorithms. He works at the moment at laboratory of applied mathematics in Lappeenranta University of Technology.
References (35)
- et al.
Similarity classifier with generalized mean applied to medical data
Comput. Biol. Med.
(2006) - et al.
Automatic detection of erythemato-squamous diseases using adaptive neuro-fuzzy inference systems
Comput. Biol. Med.
(2005) - et al.
Optimal discriminant plane for a small number of samples and design method of classifier on the plane
Pattern Recog.
(1991) - et al.
Computer-derived nuclear features distinguish malignant from benign breast cytology
Human Pathology
(1995) Fuzzy sets
Inf. Control
(1965)Similarity Relations and Fuzzy Orderings
Inf. Sci.
(1971)A model for single and multiple knowledge based networks
Artif. Intell. Med.
(2003)- et al.
Comparative analysis of statistical pattern recognition methods in high dimensional settings
Pattern Recog.
(1994) - C. Blake, C. Merz, UCI repository of machine learning databases, University of California, Irvine, Department of...
- et al.
Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals
Artif. Intell. Med.
(1998)
Assistant-86: a knowledge-elicitation tool for sophisticated users
Cancer diagnosis via linear programming
SIAM News
Cited by (48)
A hybrid filter-wrapper feature selection using Fuzzy KNN based on Bonferroni mean for medical datasets classification: A COVID-19 case study
2023, Expert Systems with ApplicationsA combination of fuzzy similarity measures and fuzzy entropy measures for supervised feature selection
2018, Expert Systems with ApplicationsCitation Excerpt :The entropy measures are applied to classification tasks since small entropy values signal regularities and structure in the data, whereas high entropy values indicate randomness (Yao, Wong, & Butz, 1999). Thus, entropy can show whether the data is informative or whether it is characterized by uncertainty (Luukka, 2007). More specifically, fuzzy entropy measures can be used to determine the relevance of features in a data set (Luukka, 2011).
Extracting easily interpreted diagnostic rules
2018, Information SciencesDevelopment of a Reinforcement Learning-based Evolutionary Fuzzy Rule-Based System for diabetes diagnosis
2017, Computers in Biology and MedicineA novel and robust Bayesian approach for segmentation of psoriasis lesions and its risk stratification
2017, Computer Methods and Programs in Biomedicine
Pasi Luukka graduated from Lappeenranta University of Technology in 1999. He took his M.S. degree from Information Technology in 1999 and Doctor of Technology degree in applied mathematics in 2005. His areas of interest are fuzzy data analysis, classification, artificial intelligence and evolutionary algorithms. He works at the moment at laboratory of applied mathematics in Lappeenranta University of Technology.