A class boundary preserving algorithm for data condensation
Introduction
The Nearest Neighbor rule is one of the most well known instance-based machine learning algorithms used for supervised non-parametric classification. It is widely used in machine learning because of its simplicity and the fact that its error probability is bounded by twice the Bayes error rate. All instances of the training set are represented by position vectors in a multidimensional feature space, and the k-Nearest Neighbor rule (k-NN) classifies unseen vectors based on their closest k instances. In its simplest form, where k=1, the output value is simply the class of the nearest neighbor. Because k-NN retains all of the original instances, its storage requirements are high. This is a major concern in instance-based learning as it leads to large computational complexity and long response times. In addition, because the entire training set is stored, noisy instances are also stored and this often degrades the classification accuracy of algorithms.
In order to tackle these problems, various reduction algorithms have been introduced, that intend to prune the size of the training set and simultaneously keep the error rate as low as possible. Data reduction methods can be categorised to instance selection algorithms, which select a small representative subset of the initial training set, and instance abstraction algorithms that generate a new set of prototypes to replace the initial ones. Instance reduction algorithms also differ in the manner the search of prototypes is conducted. The former type is further subdivided to additive algorithms, which are those that initialise an empty subset and go on by inserting instances that satisfy certain criteria, and subtractive ones that start with the initial training set and search for instances to discard.
In this paper, a new method is proposed to combine selection and abstraction to obtain a new condensed set of instances. It aims to preserve instances close to the class boundaries, since they can provide most of the information needed to effectively describe the underlying distribution. On the other hand, most of the non-boundary instances are considered redundant, because they do not affect the decision surface. The relative positions of instances to their nearest enemies are considered in order to distinguish between border and non-border vectors, with different reduction procedures applied in each case.
The structure of this paper is as follows. In Section 2 a short review of previous work done on the field of instance reduction is given. Section 3 introduces the proposed algorithm CBP, and its five steps that perform cleaning, boundary identification and instance pruning. An evaluation of CBP is presented in Section 4 and its performance is compared to some of the most established reduction approaches. In Section 5 our work is summarised and conclusions are drawn on the effectiveness of our method.
Section snippets
Previous approaches
Many works have addressed the issue of condensing the size of the training set. While the early methods focused mainly on removing harmful instances, which are the ones that decrease the classification accuracy of the algorithm, more recent algorithms also tackle redundancy. Redundant vectors are instances that do not affect the precision of the algorithm and even when removed the algorithm’s competence is not disrupted.
The proposed algorithm
In this section, our proposed CBP algorithm is presented and a thorough analysis of the steps involved is provided. Having an initial set of n d-dimensional instances, where each sample is associated with a unique class label , the problem in instance reduction is to determine a set of m prototypes (where m«n) that best describes the underlying distribution. Internal instances positioned away from class boundaries have little or no effect on classification accuracy.
Experimental analysis
The problem of the comparative evaluation of instance reduction algorithms is that their overall performance is not characterised only by the classification accuracy they exhibit, but also by the condensation ratio they achieve. Thus, there is an underlying multi-objective optimisation problem in the design and training procedure of such algorithms. These two objectives are conflicting and an improvement in one, often leads to the deterioration of the other. Consequently, there is a trade-off
Discussion
As already mentioned, instance reduction is a two-objective optimisation problem and a gain in one objective is accompanied by a worsening of the other. Therefore, Table 2 presents both accuracy and condensation ratios for all competing algorithms and all experimented datasets. From the table, it is clear that ENN, HMN and LIF, which exhibit slightly higher average accuracies, are not very successful in optimising condensation. Overall, all algorithms manage to have similar accuracies (later on
Conclusions
We have proposed a novel instance reduction method, the CBP algorithm, which employs the technique of instance selection and a simple but powerful heuristic together with the concepts of multi-level reachable sets and nearest enemy pairs, to determine the geometric structure of patterns around every sample in order to proceed with the removal of redundant instances. We examined its performance on nineteen datasets and compared the obtained results to eight instance reduction algorithms that we
Acknowledgements
This work was supported by a DTA studentship from the University of Liverpool, Department of Electrical Engineering & Electronics. We thank Eduardo Rodriguez-Martinez who provided valuable advice on the implementation, and Prof. Elena Marchiori who kindly provided us with her CCIS code used for our experimentation. We also thank the anonymous reviewers for their very useful comments and suggestions which improved the quality of the manuscript.
Nikolaidis Konstandinos is a Ph.D. student at the Department of Electrical Engineering and Electronics of Liverpool University. He received his Bachelor Degree (Hons, Class I) in the same department in 2008. His research interests include machine learning and data mining, specifically instance based learning.
References (31)
- et al.
Prototype selection for dissimilarity-based classifiers
Pattern Recognition
(2006) - et al.
Learning good prototypes for classification using filtering and abstraction of instances
Pattern Recognition
(2002) - et al.
Enhancing prototype reduction schemes with LVQ3-type algorithms
Pattern Recognition
(2003) - et al.
Mean shift-based clustering
Pattern Recognition
(2007) Asymptotic properties of nearest neighbor rules using edited data
IEEE Trans. Systems Man Cybernet.
(1972)The condensed nearest neighbor rule
IEEE Trans. Inf. Theory
(1968)The reduced nearest neighbor rule
IEEE Trans. Inf. Theory
(1972)Two modifications of CNN
IEEE Trans. Systems Man Cybernet.
(1976)Fast nearest neighbor condensation for large data sets classification
IEEE Trans. Knowledge Data Eng.
(2007)- et al.
An algorithm for a selective nearest neighbor decision rule
IEEE Trans. Inf. Theory
(1975)
Instance-based learning algorithms
Mach. Learning
Hit miss networks with applications to instance selection
Journal of Machine Learning Research
Class conditional nearest neighbor for large margin instance selection
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (70)
Evidential instance selection for K-nearest neighbor classification of big data
2021, International Journal of Approximate ReasoningCitation Excerpt :In this study, we focused on non-evolutionary based methods because previous studies [7,8,12] have demonstrated their effectiveness. In general, non-evolutionary IS algorithms can be divided into the following three families based on the different methods used to select instances: edition [7], condensation [8–11], and hybrid [12,13]. In particular, the hybrid methods focus on the instances located around the class boundaries but they are also robust to noise objects, and thus they have been widely investigated.
Semantic-k-NN algorithm: An enhanced version of traditional k-NN algorithm
2020, Expert Systems with ApplicationsNatural neighborhood graph-based instance reduction algorithm without parameters
2018, Applied Soft Computing JournalCitation Excerpt :Each algorithm must be executed 10 times, and the average results are recorded in Tables 2 and 3. 1-Nearest Neighbor classifier is used to measure the classification accuracy for reasons of simplicity and is frequently used in previous works [11,23]. For distance measurements in all algorithms we use the Euclidean distance norm.
Data preprocessing in predictive data mining
2019, Knowledge Engineering ReviewA heuristic hybrid instance reduction approach based on adaptive relative distance and k-means clustering
2024, Journal of Supercomputing
Nikolaidis Konstandinos is a Ph.D. student at the Department of Electrical Engineering and Electronics of Liverpool University. He received his Bachelor Degree (Hons, Class I) in the same department in 2008. His research interests include machine learning and data mining, specifically instance based learning.
John Yannis Goulermas obtained his B.Sc. (Hons, Class I) in Computation at UMIST in 1994. In 1996 and 2000 he obtained his M.Sc. by Research and Ph.D. in the Control Systems Centre, Manchester. He has been a member of the IEEE since 1997. He has worked in industry in the area of financial/pricing modelling and optimisation, and as a Senior Research Fellow in the areas of computer graphics, biomechanics and intelligent gait analysis. He was appointed as a Lecturer in the Department of EE&E of the University of Liverpool in January 2005. His main research interests include machine vision, machine learning and modelling/optimisation.
Q.H. Wu received his M.Sc. degree in electrical engineering from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 1981 and Ph.D. degree in electrical engineering from The Queen’s University of Belfast (QUB), Belfast, U.K., in 1987. From 1981 to 1984, he was Lecturer in Electrical Engineering at HUST. He was a Research Fellow and Senior Research Fellow with QUB from 1987 to 1991 and Lecturer and Senior Lecturer in the Department of Mathematical Sciences, Loughborough University of Technology, U.K., from 1991 to 1995. Since 1995, he has been the Chair of Electrical Engineering in the Department of Electrical Engineering and Electronics, The University of Liverpool, acting as the Head of Intelligence Engineering and Automation Group. His research interests include adaptive control, mathematical morphology, neural networks, learning systems, evolutionary computation, and power system control and operation. Dr. Wu is a Chartered Engineer and Fellow of IEE.