Elsevier

Pattern Recognition

Volume 44, Issue 3, March 2011, Pages 704-715
Pattern Recognition

A class boundary preserving algorithm for data condensation

https://doi.org/10.1016/j.patcog.2010.08.014Get rights and content

Abstract

In instance-based machine learning, algorithms often suffer from storing large numbers of training instances. This results in large computer memory usage, long response time, and often oversensitivity to noise. In order to overcome such problems, various instance reduction algorithms have been developed to remove noisy and surplus instances. This paper discusses existing algorithms in the field of instance selection and abstraction, and introduces a new approach, the Class Boundary Preserving Algorithm (CBP), which is a multi-stage method for pruning the training set, based on a simple but very effective heuristic for instance removal. CBP is tested with a large number of datasets and comparatively evaluated against eight of the most successful instance-based condensation algorithms. Experiments showed that our algorithm achieved similar classification accuracies, with much improved storage reduction and competitive execution speeds.

Introduction

The Nearest Neighbor rule is one of the most well known instance-based machine learning algorithms used for supervised non-parametric classification. It is widely used in machine learning because of its simplicity and the fact that its error probability is bounded by twice the Bayes error rate. All instances of the training set are represented by position vectors in a multidimensional feature space, and the k-Nearest Neighbor rule (k-NN) classifies unseen vectors based on their closest k instances. In its simplest form, where k=1, the output value is simply the class of the nearest neighbor. Because k-NN retains all of the original instances, its storage requirements are high. This is a major concern in instance-based learning as it leads to large computational complexity and long response times. In addition, because the entire training set is stored, noisy instances are also stored and this often degrades the classification accuracy of algorithms.

In order to tackle these problems, various reduction algorithms have been introduced, that intend to prune the size of the training set and simultaneously keep the error rate as low as possible. Data reduction methods can be categorised to instance selection algorithms, which select a small representative subset of the initial training set, and instance abstraction algorithms that generate a new set of prototypes to replace the initial ones. Instance reduction algorithms also differ in the manner the search of prototypes is conducted. The former type is further subdivided to additive algorithms, which are those that initialise an empty subset and go on by inserting instances that satisfy certain criteria, and subtractive ones that start with the initial training set and search for instances to discard.

In this paper, a new method is proposed to combine selection and abstraction to obtain a new condensed set of instances. It aims to preserve instances close to the class boundaries, since they can provide most of the information needed to effectively describe the underlying distribution. On the other hand, most of the non-boundary instances are considered redundant, because they do not affect the decision surface. The relative positions of instances to their nearest enemies are considered in order to distinguish between border and non-border vectors, with different reduction procedures applied in each case.

The structure of this paper is as follows. In Section 2 a short review of previous work done on the field of instance reduction is given. Section 3 introduces the proposed algorithm CBP, and its five steps that perform cleaning, boundary identification and instance pruning. An evaluation of CBP is presented in Section 4 and its performance is compared to some of the most established reduction approaches. In Section 5 our work is summarised and conclusions are drawn on the effectiveness of our method.

Section snippets

Previous approaches

Many works have addressed the issue of condensing the size of the training set. While the early methods focused mainly on removing harmful instances, which are the ones that decrease the classification accuracy of the algorithm, more recent algorithms also tackle redundancy. Redundant vectors are instances that do not affect the precision of the algorithm and even when removed the algorithm’s competence is not disrupted.

The proposed algorithm

In this section, our proposed CBP algorithm is presented and a thorough analysis of the steps involved is provided. Having an initial set X={xd} of n d-dimensional instances, where each sample is associated with a unique class label ψ(x)L={l1,...,lc}, the problem in instance reduction is to determine a set of m prototypes (where m«n) that best describes the underlying distribution. Internal instances positioned away from class boundaries have little or no effect on classification accuracy.

Experimental analysis

The problem of the comparative evaluation of instance reduction algorithms is that their overall performance is not characterised only by the classification accuracy they exhibit, but also by the condensation ratio they achieve. Thus, there is an underlying multi-objective optimisation problem in the design and training procedure of such algorithms. These two objectives are conflicting and an improvement in one, often leads to the deterioration of the other. Consequently, there is a trade-off

Discussion

As already mentioned, instance reduction is a two-objective optimisation problem and a gain in one objective is accompanied by a worsening of the other. Therefore, Table 2 presents both accuracy and condensation ratios for all competing algorithms and all experimented datasets. From the table, it is clear that ENN, HMN and LIF, which exhibit slightly higher average accuracies, are not very successful in optimising condensation. Overall, all algorithms manage to have similar accuracies (later on

Conclusions

We have proposed a novel instance reduction method, the CBP algorithm, which employs the technique of instance selection and a simple but powerful heuristic together with the concepts of multi-level reachable sets and nearest enemy pairs, to determine the geometric structure of patterns around every sample in order to proceed with the removal of redundant instances. We examined its performance on nineteen datasets and compared the obtained results to eight instance reduction algorithms that we

Acknowledgements

This work was supported by a DTA studentship from the University of Liverpool, Department of Electrical Engineering & Electronics. We thank Eduardo Rodriguez-Martinez who provided valuable advice on the implementation, and Prof. Elena Marchiori who kindly provided us with her CCIS code used for our experimentation. We also thank the anonymous reviewers for their very useful comments and suggestions which improved the quality of the manuscript.

Nikolaidis Konstandinos is a Ph.D. student at the Department of Electrical Engineering and Electronics of Liverpool University. He received his Bachelor Degree (Hons, Class I) in the same department in 2008. His research interests include machine learning and data mining, specifically instance based learning.

References (31)

  • D.W. Aha et al.

    Instance-based learning algorithms

    Mach. Learning

    (1991)
  • G.T. Toussaint, R.S. Poulsen, Some new algorithms and software implementation methods for pattern recognition research,...
  • E. Marchiori

    Hit miss networks with applications to instance selection

    Journal of Machine Learning Research

    (2008)
  • E. Marchiori, Graph-based discrete differential geometry for critical instance filtering, in: Joint European Conference...
  • E. Marchiori

    Class conditional nearest neighbor for large margin instance selection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • Cited by (70)

    • Evidential instance selection for K-nearest neighbor classification of big data

      2021, International Journal of Approximate Reasoning
      Citation Excerpt :

      In this study, we focused on non-evolutionary based methods because previous studies [7,8,12] have demonstrated their effectiveness. In general, non-evolutionary IS algorithms can be divided into the following three families based on the different methods used to select instances: edition [7], condensation [8–11], and hybrid [12,13]. In particular, the hybrid methods focus on the instances located around the class boundaries but they are also robust to noise objects, and thus they have been widely investigated.

    • Natural neighborhood graph-based instance reduction algorithm without parameters

      2018, Applied Soft Computing Journal
      Citation Excerpt :

      Each algorithm must be executed 10 times, and the average results are recorded in Tables 2 and 3. 1-Nearest Neighbor classifier is used to measure the classification accuracy for reasons of simplicity and is frequently used in previous works [11,23]. For distance measurements in all algorithms we use the Euclidean distance norm.

    • Data preprocessing in predictive data mining

      2019, Knowledge Engineering Review
    View all citing articles on Scopus

    Nikolaidis Konstandinos is a Ph.D. student at the Department of Electrical Engineering and Electronics of Liverpool University. He received his Bachelor Degree (Hons, Class I) in the same department in 2008. His research interests include machine learning and data mining, specifically instance based learning.

    John Yannis Goulermas obtained his B.Sc. (Hons, Class I) in Computation at UMIST in 1994. In 1996 and 2000 he obtained his M.Sc. by Research and Ph.D. in the Control Systems Centre, Manchester. He has been a member of the IEEE since 1997. He has worked in industry in the area of financial/pricing modelling and optimisation, and as a Senior Research Fellow in the areas of computer graphics, biomechanics and intelligent gait analysis. He was appointed as a Lecturer in the Department of EE&E of the University of Liverpool in January 2005. His main research interests include machine vision, machine learning and modelling/optimisation.

    Q.H. Wu received his M.Sc. degree in electrical engineering from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 1981 and Ph.D. degree in electrical engineering from The Queen’s University of Belfast (QUB), Belfast, U.K., in 1987. From 1981 to 1984, he was Lecturer in Electrical Engineering at HUST. He was a Research Fellow and Senior Research Fellow with QUB from 1987 to 1991 and Lecturer and Senior Lecturer in the Department of Mathematical Sciences, Loughborough University of Technology, U.K., from 1991 to 1995. Since 1995, he has been the Chair of Electrical Engineering in the Department of Electrical Engineering and Electronics, The University of Liverpool, acting as the Head of Intelligence Engineering and Automation Group. His research interests include adaptive control, mathematical morphology, neural networks, learning systems, evolutionary computation, and power system control and operation. Dr. Wu is a Chartered Engineer and Fellow of IEE.

    View full text