A new incomplete pattern belief classification method with multiple estimations based on KNN

https://doi.org/10.1016/j.asoc.2020.106175Get rights and content

Abstract

The classification of missing data is a challenging task, because the lack of pattern attributes may bring uncertainty to the classification results and most classification methods produce only one estimation, which may have a risk of misclassification. A new incomplete pattern belief classification (PBC) method with multiple estimations based on K-nearest neighbors (KNNs) is proposed to deal with missing data. PBC preliminarily classifies the incomplete pattern using its KNNs obtained by the known attributes. The pattern whose KNNs contain only one class information can be directly divided into this class. If not, the p (pc) estimations will be computed according to the different KNNs in different classes when p classes are included in the KNNs of the pattern and it will yield p pieces of classification results by the chosen classifier. Then, a weighted possibility distance method is used to further divide the p classification results with their KNNs’ classification information. The pattern with similar possibility distances in different classes will be reasonably classified into a proper meta-class under the framework of belief functions theory, which truly reflects the uncertainty of the pattern caused by missing values and effectively reduces the error rate. Experiments on both artificial and real data sets show that PBC is effective for dealing with missing data.

Introduction

Incomplete pattern, also called missing data, is a very common phenomenon in practice. 45% of data sets, for example, have missing values in the UCI machine learning repository [1], where the data sets are real cases from various fields. The reason of attributes missing is various, such as the sensors used to collect information have failed, some questions related to personal privacy are refused to answer in social survey and some tests cannot be done for each patient in medical field [2], [3]. A number of methods [3], [4], [5] have emerged for classifying incomplete pattern, especially in which KNNs technology and its derivatives have been widely used in many cases [6], [7], [8], [9], [10], [11], [12], [13], [14], [15] due to its strong maneuverability.

An earlier version of K-nearest neighbors imputation (KNNI) was proposed in [6], where the missing values in incomplete pattern are replaced by the average of the corresponding attribute of the KNNs, and a popular weighted K-nearest neighbor imputation (WKNNI) method is introduced in [7] based on DNA microarray data to reasonably model the (more or less) effects of different the KNNs on the pattern. Particularly, some works have been dedicated to integrating KNNI with other technologies [9], [10], [11]. For instance, an interesting adaptively imputed for missing values is presented in [9] using KNNs and Self-Organizing Map (SOM) based on belief functions theory [16], [17], [18] to capture uncertainty that may be caused by missing values. In [11], an effective No Skip-KNN imputation (NS-kNNI) method is developed to solve the problem that the estimated values is always higher than the ground truth using KNNs with the minimum abundance of all patterns to replace the missing values. Recently, some studies have begun to focus on the impact of missing degrees and outliers on the classification accuracy of incomplete pattern [12], [13], [14], such as a novel purity-based K-nearest neighbors imputation (PKNNI) is put into practice in financial distress prediction [14] to improve the overall performance of the different missing degrees and the robustness to the outliers. Interestingly, a locally linear approximation (LLA) approach for incomplete data is proposed in [15], where the missing values are estimated using KNNs with optimal weights obtained by the locally linear reconstruction. These imputation methods mentioned above, especially based on KNNs technology, may still have some problems as follows:

(1) The necessity of imputation technology may be not considered. Estimation (imputation) strategies may lead to waste of resources and even increase the risk of error. For a specific pattern, we may be able to directly determine its class when its KNNs obtained from training data set are all belonging to one class. In such case, if one makes a decision based on the output of the classifier, it will increase the risk of error because the classifier is globally optimal, which may not be applicable to each pattern.

(2) The only one estimation for the incomplete pattern may be inaccurate based on traditional KNNs technology. For one pattern with a particular class (although we do not know the label), it may be unreasonable to use KNNs with multiple classes information to obtain the estimation, and the selection of KNNs depends heavily on the completeness of the pattern. That is, the higher the missing degree of incomplete pattern, the more distorted the KNNs obtained.

(3) The imprecision that the estimation may bring is not taken into account. For one pattern, it may obtain different estimations if different imputation methods are adopted, which indicates that the estimations cannot replace the ground truth and will inevitably lead to imprecision.

Therefore, a new multiple estimations classification method based on KNNs technology is proposed to address the above problems maybe occur in the process of incomplete pattern classification, and it is inspired by some existing multiple imputation methods [4], [8], [19], [20], [21]. In the early idea of multiple imputation [19], missing values are imputed m times, and m complete data sets are generated based on the appropriate model of random variation, but sometimes it is difficult to get the model. A multiple imputation strategy based on extreme learning machine is proposed in [20], which has a good performance in solving the general regression problem under missing data. The literature [8] researches the development of automated data imputation models and proposes a multiple imputation method based on the combination of multilayer perceptron and KNNs. Whereas these multiple imputation methods do not consider the imprecision caused by missing values. An easily available version of multiple estimations is introduced in the prototype-based credal classification (PCC) method [4]. In PCC, the missing values of incomplete pattern are replaced respectively by the center of different classes, and each estimation of the pattern will be classified by a standard classifier and the sub-classifications of the pattern will be globally fused, which can characterize the imprecision of classification due to the absence of part attributes and also reduce the misclassification. However, the estimations for incomplete pattern based on class prototypes are not accurate enough. In addition, the above multiple imputation strategies do not consider the possible negative impact of the global optimal classifier, which may not suit to some special patterns.

In our recent work, a new pattern classification accuracy improvement (CIA) method [5] working with local quality matrix is proposed to overcome the shortcomings of global optimal classifier. The classification result of the pattern obtained by the global optimal classifier will be corrected by the quality matrix, which is used to express the conditional probability of the pattern belonging to one class when classified to another class, and it is estimated based on the KNNs of the pattern. The experimental results show that the correction of global classifier result is beneficial to the accurate classification of the pattern, whereas it will bring some computational burden. In this paper, we develop a simplified possibility distance version to overcome the possible negative effects of the basic classifier on a specific pattern with the classification results of its KNNs and use meta-class introduced in belief functions theory [16], [18], [20] to characterize the imprecision and uncertainty caused by the lack of information.

Belief functions theory [16], [18], [20], also called Dempster–Shafer theory (DST) or evidence reasoning, has some special advantages in expressing this kind of uncertain and imprecise information, and is widely used in many fields including data classification [4], [5], [22], [23], data clustering [24], and decision-making [25]. In belief functions theory, the frame of discernment Ω={ω1,,ωc} is extended to power-set 2Ω, which contains all subsets of Ω. Pattern can be divided into three classes: singleton (specific) class (e.g., ωi), meta-class (e.g., ωiωk), and the outlier (noise) class represented by ϕ. Among them, singleton class represents a definite class, which is used to represent the exact information and meta-class is composed of multiple singleton classes, which is used to express uncertain and imprecise information. For a specific pattern xi, it may belong to these singleton classes ω1 and ω2 if divided into meta-class ω1ω2. In other words, the pattern xi is in the overlapping region of the singleton classes ω1 and ω2, where it is actually difficult to classify it into any singleton classes. On the contrary, it may lead to the risk of misclassification if xi is forcibly classified into ω1 or ω2.

In this paper, we develop a new incomplete pattern belief classification (PBC) method with multiple estimations based on KNNs technology. PBC method will provide particular (specific) class for the incomplete pattern without estimation or provide multiple possible estimations of missing values according to the class information of its KNNs obtained by the known attributes. For a c-class problem, the KNNs of the incomplete pattern will be obtained from training patterns, and the pattern will be directly classified into the class rather than estimate the missing values if only one class information is included in the KNNs; whereas the p (pc) possible estimations will be computed if the KNNs contains p classes information. Each estimated pattern is classified by the trained standard classifier. So, it will yield p pieces of classification results which may support one or multiple classes. Then, one weighted possibility distance method is used to further divide the p classification results. Specifically, the sum of the distance, named possibility distance, between the classification result of each estimated pattern and its corresponding KNNs in the same class is used as a referee to determine the possible classification information of the pattern. Therefore, there are a total of p possibility distances, where the class with the smallest possibility distance will be designated as the final class of the pattern. However, it can also happen that there is no significant difference in some of the p possibility distances, which indicates that the class of this pattern is quite imprecise (uncertain) only based on the known information. In such case, it is very difficult to correctly classify the pattern in a singleton (specific) class, and it becomes more prudent and reasonable to assign the pattern to a meta-class (partial imprecise class), which can not only reflect the uncertainty of classification caused by missing values, but also effectively reduce the error rate.

The rest of this paper is organized as follows. Some methods used for comparison and basic knowledge about belief functions theory are introduced in Section 2, the new PBC method is introduced in detail in Section 3. The proposed method PBC is then tested in Section 4 and compared with several other classical methods. Some related discusses contain in Section 5, followed by conclusion.

Section snippets

Background knowledge

This section presents some related classical methods and basic knowledge.

Belief classification with multiple estimations

In this section, we propose a new incomplete pattern belief classification (PBC) method, which can avoid invalid imputation and reasonably characterize the uncertainty and imprecision caused by missing values. PBC firstly estimate the necessary of imputation for the query incomplete pattern by using the KNNs, i.e., PBC can directly submit the pattern to a specific class if its neighbors are all from the same class. Conversely, if there existing multiple classes in the KNNs, PBC will provide

Experiment applications

Three experiments have been carried out to test and evaluate the performance of PBC with respect to MI [27], KNNI [6], FCMI [28], PCC [4], LLA [15] and GAIN [32]. K=9 is default in KNNI, LLA and PBC, and the other parameters contained in different methods are default. To verify the implementation of PBC does not depend on the selection of basic classifier, EK-NN, K-NN, SVM and ANN are employed as basic classifiers in the sequel experiments. Since we mainly focus on the classification of

Discussion

Here we discuss the computational complexity of PBC and its influencing factors. Let us assume that n incomplete patterns in the test set and they are classified using the training set with n patterns in the class editing framework Ω={ω1,ω2,,ωc}. The computational complexity of PBC mainly focuses on the computation of distance between patterns to obtain KNNs. In the preliminary classification of PBC, each test pattern needs to calculate n distances to obtain KNNs from the training set, so a

Conclusion

New method PBC is proposed for classifying incomplete patterns. In PBC, KNNs obtained by known attributes of incomplete patterns are used to preliminarily classify, which effectively avoids the risk of error imputation and waste of resources. A cautious multiple imputation strategy according to the classes information in KNNs, which can well characterize the uncertainty of incomplete pattern. Then, a weighted possibility distance method is proposed to final determine the class of incomplete

CRediT authorship contribution statement

Zong-fang Ma: Software, Validation. Hong-peng Tian: Investigation, Formal analysis. Ze-chao Liu: Data curation. Zuo-wei Zhang: Conceptualization, Methodology, Writing - review & editing.

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2020.106175.

Acknowledgment

This work has been partially supported by Industrialization cultivation project of Shaanxi Provincial Department of Education (No. 18JC017).

References (41)

  • LiuZ.G. et al.

    Credal c-means clustering method based on belief functions

    Knowl.-Based Syst.

    (2015)
  • WangJ.Q. et al.

    Intuitionistic fuzzy multi-criteria decision-making method based on evidential reasoning

    Appl. Soft Comput.

    (2013)
  • MingL.K. et al.

    Autonomous and deterministic supervised fuzzy clustering with data imputation capabilities

    Appl. Soft Comput.

    (2011)
  • PelckmansK. et al.

    Handling missing values in support vector machine classifiers

    Neural Netw.

    (2005)
  • Calvo-ZaragozaJ. et al.

    Improving kNN multi-label classification in prototype selection scenarios using class proposals

    Pattern Recognit.

    (2015)
  • LiuZ.G. et al.

    A new belief-based K-nearest neighbor classification method

    Pattern Recognit.

    (2013)
  • FrankA. et al.

    UCI Machine Learning Repository

    (2010)
  • GaoH. et al.

    A subspace ensemble framework for classification with high dimensional missing data

    Multidimens. Syst. Signal Process.

    (2017)
  • LiuZ.G. et al.

    A new incomplete pattern classification method based on evidential reasoning

    IEEE Trans. Cybern.

    (2015)
  • AcunaEdgar et al.

    The treatment of missing values and its effect on classifier accuracy, classification clustering & data mining applications

    (2004)
  • Cited by (31)

    • Application of data-mining technique and hydro-chemical data for evaluating vulnerability of groundwater in Indo-Gangetic Plain

      2022, Journal of Environmental Management
      Citation Excerpt :

      This ability will be examined, for the first time, on vulnerability of water resource mapping. Besides, the most dominant and simple data-mining method is K-nearest neighbors (KNN), due to its easier interpretation and low calculation time which has been employed in the hydraulic model (Liu et al., 2016), the classification of missing data (Ma et al., 2020), as well as flood prediction (Liu et al., 2021). However, its potential as a spatial modeling approach for water resource vulnerability mapping is still unexplored.

    • Estimation of missing values in astronomical survey data: An improved local approach using cluster directed neighbor selection

      2022, Information Processing and Management
      Citation Excerpt :

      Only a subset of samples that exhibits high correlation with the one containing missing values is used to approximate estimates. Examples of these include K-nearest neighbor imputation or KNNimpute (Jordanov, Petrov, & Petrozziello, 2018; Ma, Tian, Liu, & Zhang, 2020), and local least square imputation or LLSimpute (Wang et al., 2019). Specific to the former, several extensions have been proposed over the years.

    View all citing articles on Scopus
    View full text