A new incomplete pattern belief classification method with multiple estimations based on KNN

doi:10.1016/j.asoc.2020.106175

Applied Soft Computing

Volume 90, May 2020, 106175

https://doi.org/10.1016/j.asoc.2020.106175 Get rights and content

Abstract

The classification of missing data is a challenging task, because the lack of pattern attributes may bring uncertainty to the classification results and most classification methods produce only one estimation, which may have a risk of misclassification. A new incomplete pattern belief classification (PBC) method with multiple estimations based on $K$ -nearest neighbors (KNNs) is proposed to deal with missing data. PBC preliminarily classifies the incomplete pattern using its KNNs obtained by the known attributes. The pattern whose KNNs contain only one class information can be directly divided into this class. If not, the $p$ ( $p \leq c$ ) estimations will be computed according to the different KNNs in different classes when $p$ classes are included in the KNNs of the pattern and it will yield $p$ pieces of classification results by the chosen classifier. Then, a weighted possibility distance method is used to further divide the $p$ classification results with their KNNs’ classification information. The pattern with similar possibility distances in different classes will be reasonably classified into a proper meta-class under the framework of belief functions theory, which truly reflects the uncertainty of the pattern caused by missing values and effectively reduces the error rate. Experiments on both artificial and real data sets show that PBC is effective for dealing with missing data.

Introduction

Incomplete pattern, also called missing data, is a very common phenomenon in practice. 45% of data sets, for example, have missing values in the UCI machine learning repository [1], where the data sets are real cases from various fields. The reason of attributes missing is various, such as the sensors used to collect information have failed, some questions related to personal privacy are refused to answer in social survey and some tests cannot be done for each patient in medical field [2], [3]. A number of methods [3], [4], [5] have emerged for classifying incomplete pattern, especially in which KNNs technology and its derivatives have been widely used in many cases [6], [7], [8], [9], [10], [11], [12], [13], [14], [15] due to its strong maneuverability.

An earlier version of $K$ -nearest neighbors imputation (KNNI) was proposed in [6], where the missing values in incomplete pattern are replaced by the average of the corresponding attribute of the KNNs, and a popular weighted $K$ -nearest neighbor imputation (WKNNI) method is introduced in [7] based on DNA microarray data to reasonably model the (more or less) effects of different the KNNs on the pattern. Particularly, some works have been dedicated to integrating KNNI with other technologies [9], [10], [11]. For instance, an interesting adaptively imputed for missing values is presented in [9] using KNNs and Self-Organizing Map (SOM) based on belief functions theory [16], [17], [18] to capture uncertainty that may be caused by missing values. In [11], an effective No Skip-KNN imputation (NS-kNNI) method is developed to solve the problem that the estimated values is always higher than the ground truth using KNNs with the minimum abundance of all patterns to replace the missing values. Recently, some studies have begun to focus on the impact of missing degrees and outliers on the classification accuracy of incomplete pattern [12], [13], [14], such as a novel purity-based $K$ -nearest neighbors imputation (PKNNI) is put into practice in financial distress prediction [14] to improve the overall performance of the different missing degrees and the robustness to the outliers. Interestingly, a locally linear approximation (LLA) approach for incomplete data is proposed in [15], where the missing values are estimated using KNNs with optimal weights obtained by the locally linear reconstruction. These imputation methods mentioned above, especially based on KNNs technology, may still have some problems as follows:

(1) The necessity of imputation technology may be not considered. Estimation (imputation) strategies may lead to waste of resources and even increase the risk of error. For a specific pattern, we may be able to directly determine its class when its KNNs obtained from training data set are all belonging to one class. In such case, if one makes a decision based on the output of the classifier, it will increase the risk of error because the classifier is globally optimal, which may not be applicable to each pattern.

(2) The only one estimation for the incomplete pattern may be inaccurate based on traditional KNNs technology. For one pattern with a particular class (although we do not know the label), it may be unreasonable to use KNNs with multiple classes information to obtain the estimation, and the selection of KNNs depends heavily on the completeness of the pattern. That is, the higher the missing degree of incomplete pattern, the more distorted the KNNs obtained.

(3) The imprecision that the estimation may bring is not taken into account. For one pattern, it may obtain different estimations if different imputation methods are adopted, which indicates that the estimations cannot replace the ground truth and will inevitably lead to imprecision.

Therefore, a new multiple estimations classification method based on KNNs technology is proposed to address the above problems maybe occur in the process of incomplete pattern classification, and it is inspired by some existing multiple imputation methods [4], [8], [19], [20], [21]. In the early idea of multiple imputation [19], missing values are imputed $m$ times, and $m$ complete data sets are generated based on the appropriate model of random variation, but sometimes it is difficult to get the model. A multiple imputation strategy based on extreme learning machine is proposed in [20], which has a good performance in solving the general regression problem under missing data. The literature [8] researches the development of automated data imputation models and proposes a multiple imputation method based on the combination of multilayer perceptron and KNNs. Whereas these multiple imputation methods do not consider the imprecision caused by missing values. An easily available version of multiple estimations is introduced in the prototype-based credal classification (PCC) method [4]. In PCC, the missing values of incomplete pattern are replaced respectively by the center of different classes, and each estimation of the pattern will be classified by a standard classifier and the sub-classifications of the pattern will be globally fused, which can characterize the imprecision of classification due to the absence of part attributes and also reduce the misclassification. However, the estimations for incomplete pattern based on class prototypes are not accurate enough. In addition, the above multiple imputation strategies do not consider the possible negative impact of the global optimal classifier, which may not suit to some special patterns.

In our recent work, a new pattern classification accuracy improvement (CIA) method [5] working with local quality matrix is proposed to overcome the shortcomings of global optimal classifier. The classification result of the pattern obtained by the global optimal classifier will be corrected by the quality matrix, which is used to express the conditional probability of the pattern belonging to one class when classified to another class, and it is estimated based on the KNNs of the pattern. The experimental results show that the correction of global classifier result is beneficial to the accurate classification of the pattern, whereas it will bring some computational burden. In this paper, we develop a simplified possibility distance version to overcome the possible negative effects of the basic classifier on a specific pattern with the classification results of its KNNs and use meta-class introduced in belief functions theory [16], [18], [20] to characterize the imprecision and uncertainty caused by the lack of information.

Belief functions theory [16], [18], [20], also called Dempster–Shafer theory (DST) or evidence reasoning, has some special advantages in expressing this kind of uncertain and imprecise information, and is widely used in many fields including data classification [4], [5], [22], [23], data clustering [24], and decision-making [25]. In belief functions theory, the frame of discernment $Ω = {ω_{1}, \dots, ω_{c}}$ is extended to power-set $2^{Ω}$ , which contains all subsets of $Ω$ . Pattern can be divided into three classes: singleton (specific) class (e.g., $ω_{i}$ ), meta-class (e.g., $ω_{i} \cup \cdot \cdot \cdot \cup ω_{k}$ ), and the outlier (noise) class represented by $ϕ$ . Among them, singleton class represents a definite class, which is used to represent the exact information and meta-class is composed of multiple singleton classes, which is used to express uncertain and imprecise information. For a specific pattern $x_{i}$ , it may belong to these singleton classes $ω_{1}$ and $ω_{2}$ if divided into meta-class $ω_{1} \cup ω_{2}$ . In other words, the pattern $x_{i}$ is in the overlapping region of the singleton classes $ω_{1}$ and $ω_{2}$ , where it is actually difficult to classify it into any singleton classes. On the contrary, it may lead to the risk of misclassification if $x_{i}$ is forcibly classified into $ω_{1}$ or $ω_{2}$ .

In this paper, we develop a new incomplete pattern belief classification (PBC) method with multiple estimations based on KNNs technology. PBC method will provide particular (specific) class for the incomplete pattern without estimation or provide multiple possible estimations of missing values according to the class information of its KNNs obtained by the known attributes. For a $c$ -class problem, the KNNs of the incomplete pattern will be obtained from training patterns, and the pattern will be directly classified into the class rather than estimate the missing values if only one class information is included in the KNNs; whereas the $p$ ( $p \leq c$ ) possible estimations will be computed if the KNNs contains $p$ classes information. Each estimated pattern is classified by the trained standard classifier. So, it will yield $p$ pieces of classification results which may support one or multiple classes. Then, one weighted possibility distance method is used to further divide the $p$ classification results. Specifically, the sum of the distance, named possibility distance, between the classification result of each estimated pattern and its corresponding KNNs in the same class is used as a referee to determine the possible classification information of the pattern. Therefore, there are a total of $p$ possibility distances, where the class with the smallest possibility distance will be designated as the final class of the pattern. However, it can also happen that there is no significant difference in some of the $p$ possibility distances, which indicates that the class of this pattern is quite imprecise (uncertain) only based on the known information. In such case, it is very difficult to correctly classify the pattern in a singleton (specific) class, and it becomes more prudent and reasonable to assign the pattern to a meta-class (partial imprecise class), which can not only reflect the uncertainty of classification caused by missing values, but also effectively reduce the error rate.

The rest of this paper is organized as follows. Some methods used for comparison and basic knowledge about belief functions theory are introduced in Section 2, the new PBC method is introduced in detail in Section 3. The proposed method PBC is then tested in Section 4 and compared with several other classical methods. Some related discusses contain in Section 5, followed by conclusion.

Section snippets

Background knowledge

This section presents some related classical methods and basic knowledge.

Belief classification with multiple estimations

In this section, we propose a new incomplete pattern belief classification (PBC) method, which can avoid invalid imputation and reasonably characterize the uncertainty and imprecision caused by missing values. PBC firstly estimate the necessary of imputation for the query incomplete pattern by using the KNNs, i.e., PBC can directly submit the pattern to a specific class if its neighbors are all from the same class. Conversely, if there existing multiple classes in the KNNs, PBC will provide

Experiment applications

Three experiments have been carried out to test and evaluate the performance of PBC with respect to MI [27], KNNI [6], FCMI [28], PCC [4], LLA [15] and GAIN [32]. $K = 9$ is default in KNNI, LLA and PBC, and the other parameters contained in different methods are default. To verify the implementation of PBC does not depend on the selection of basic classifier, EK-NN, K-NN, SVM and ANN are employed as basic classifiers in the sequel experiments. Since we mainly focus on the classification of

Discussion

Here we discuss the computational complexity of PBC and its influencing factors. Let us assume that $n$ incomplete patterns in the test set and they are classified using the training set with $n$ patterns in the class editing framework $Ω = {ω_{1}, ω_{2}, \dots, ω_{c}}$ . The computational complexity of PBC mainly focuses on the computation of distance between patterns to obtain KNNs. In the preliminary classification of PBC, each test pattern needs to calculate $n$ distances to obtain KNNs from the training set, so a

Conclusion

New method PBC is proposed for classifying incomplete patterns. In PBC, KNNs obtained by known attributes of incomplete patterns are used to preliminarily classify, which effectively avoids the risk of error imputation and waste of resources. A cautious multiple imputation strategy according to the classes information in KNNs, which can well characterize the uncertainty of incomplete pattern. Then, a weighted possibility distance method is proposed to final determine the class of incomplete

CRediT authorship contribution statement

Zong-fang Ma: Software, Validation. Hong-peng Tian: Investigation, Formal analysis. Ze-chao Liu: Data curation. Zuo-wei Zhang: Conceptualization, Methodology, Writing - review & editing.

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2020.106175.

Acknowledgment

This work has been partially supported by Industrialization cultivation project of Shaanxi Provincial Department of Education (No. 18JC017).

References (41)

TranC.T. et al.
Improving performance of classification on incomplete data using feature selection and clustering
Appl. Soft Comput.
(2018)
LiuZ.G. et al.
A new pattern classification improvement method with local quality matrix based on K-NN
Knowl.-Based Syst.
(2019)
BrásLígia P. et al.
Improving cluster-based missing value estimation of DNA microarray data
Biomol. Eng.
(2007)
Silva-RamírezEsther-Lydia et al.
Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns
Appl. Soft Comput.
(2015)
LiuZ.G. et al.
Adaptive imputation of missing values for incomplete pattern classification
Pattern Recognit.
(2016)
TsaiC.F. et al.
Combining instance selection for better missing value imputation
J. Syst. Softw.
(2016)
ChengChing-Hsue et al.
A novel purity-based k nearest neighbors imputation method and its application in financial distress prediction
Eng. Appl. Artif. Intell.
(2019)
SmetsP.
Analyzing the combination of conflicting belief functions
Inf. Fusion
(2007)
SoviljDus̆an et al.
Extreme learning machine for missing data using multiple imputations
Neurocomputing
(2016)
BhattacharyyaArundhati et al.
Evidence theoretic classification of ballistic missiles
Appl. Soft Comput.
(2015)

LiuZ.G. et al.

Credal c-means clustering method based on belief functions

Knowl.-Based Syst.

(2015)

WangJ.Q. et al.

Intuitionistic fuzzy multi-criteria decision-making method based on evidential reasoning

Appl. Soft Comput.

(2013)

MingL.K. et al.

Autonomous and deterministic supervised fuzzy clustering with data imputation capabilities

Appl. Soft Comput.

(2011)

PelckmansK. et al.

Handling missing values in support vector machine classifiers

Neural Netw.

(2005)

Calvo-ZaragozaJ. et al.

Improving kNN multi-label classification in prototype selection scenarios using class proposals

Pattern Recognit.

(2015)

LiuZ.G. et al.

A new belief-based K-nearest neighbor classification method

Pattern Recognit.

(2013)

FrankA. et al.

UCI Machine Learning Repository

(2010)

GaoH. et al.

A subspace ensemble framework for classification with high dimensional missing data

Multidimens. Syst. Signal Process.

(2017)

LiuZ.G. et al.

A new incomplete pattern classification method based on evidential reasoning

IEEE Trans. Cybern.

(2015)

AcunaEdgar et al.

The treatment of missing values and its effect on classifier accuracy, classification clustering & data mining applications

(2004)

Cited by (31)

Summarising multiple clustering-centric estimates with OWA operators for improved KNN imputation on microarray data
2023, Fuzzy Sets and Systems
As part of celebrating the success of OWA operators and their contributions over the past decades, this work presents an original investigation of exploiting OWA in dealing with missing value imputation witnessed in microarray experimental data. This task is significant in life science and its realisation to humanity. Both argument-independent and -dependent variants of such operators are applied to summarise a collection of estimates, determined through the concept of clustering-centric KNN imputation. This provides an innovative alternative to the state-of-the-art model that makes use of a single clustering to identify neighbours of a particular instance of interest. Instead of manually specify data partition, the proposed approach works by selecting a subset of diverse clusterings or committees from a candidate pool, which has been prepared using k-means and different (and popular) generation strategies invented for ensemble clustering. It is automated through a greedy forward-search looking for a desired number of committee members. Based on published gene expression datasets and different experimental settings, the resulting model generally outperforms its baselines, being competitive to related methods found in the literature. Further extensions to iterative refinement and supervised imputation are also discussed in addition to the analysis of algorithmic parameters.
Risk assessment of cardiovascular disease based on SOLSSA-CatBoost model
2023, Expert Systems with Applications
Cardiovascular disease (CVD) has become a significant public health problem affecting national economic and social development, and ranks among the top causes of death in the world. Thus, people pay increasing attention to the prevention, control, and risk assessment of CVD. In this paper, an improved sparrow search algorithm (SSA) is designed to optimize the parameters of Categorical Boosting (CatBoost) model, and it is applied to the risk assessment of CVD. The contributions of this research are mainly in the following aspects: (1) In the position update formula of the discoverer, the salp swarm algorithm is integrated, the global optimal solution of the previous generation is added to improve the global search ability and local development ability of SSA; (2) Using Opposition-based Learning (OBL) and Lateral mutation strategy to improve the search ability of the worst individual; (3) Sparrow search algorithm based on salp swarm algorithm, OBL and Lateral mutation strategy (SOLSSA) is used to optimize parameters of CatBoost to improve the prediction effect, and the experiments are carried out for the proposed model (SOLSSA-CatBoost) using two CVD data sets on Kaggle. The proposed model is compared with six machine learning models, including random forest (RF), logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), light gradient Boosting (LGB) and CatBoost, and is also compared with other four optimization algorithms (whale optimization algorithm (WOA), gray wolf algorithm (GWO), seagull optimization algorithm (SOA) and SSA) in optimizing the performance of the CatBoost. The experimental results show that compared with other comparison algorithms, SOLSSA-CatBoost has better prediction effect on test set, with F1-score reaching 90% and 81.51% in two CVD data sets respectively. The SOLSSA-CatBoost model in this paper can make a more accurate prediction of patients' disease risk, and provide a certain basis for doctors to judge the condition.
Application of data-mining technique and hydro-chemical data for evaluating vulnerability of groundwater in Indo-Gangetic Plain
2022, Journal of Environmental Management
Citation Excerpt :
This ability will be examined, for the first time, on vulnerability of water resource mapping. Besides, the most dominant and simple data-mining method is K-nearest neighbors (KNN), due to its easier interpretation and low calculation time which has been employed in the hydraulic model (Liu et al., 2016), the classiﬁcation of missing data (Ma et al., 2020), as well as flood prediction (Liu et al., 2021). However, its potential as a spatial modeling approach for water resource vulnerability mapping is still unexplored.
Vulnerability of groundwater is critical for the sustainable development of groundwater resources, especially in freshwater-limited coastal Indo-Gangetic plains. Here, we intend to develop an integrated novel approach for delineating groundwater vulnerability using hydro-chemical analysis and data-mining methods, i.e., Decision Tree (DT) and K-Nearest Neighbor (KNN) via k-fold cross-validation (CV) technique. A total of 110 of groundwater samples were obtained during the dry and wet seasons to generate an inventory map. Four K-fold CV approach was used to delineate the vulnerable region from sixteen vulnerability causal factors. The statistical error metrics i.e., receiver operating characteristic-area under the curve (AUC-ROC) and other advanced metrices were adopted to validate model outcomes. The results demonstrated the excellent ability of the proposed models to recognize the vulnerability of groundwater zones in the Indo-Gangetic plain. The DT model revealed higher performance (AUC = 0.97) followed by KNN model (AUC = 0.95). The north-central and north-eastern parts are more vulnerable due to high salinity, Nitrate (NO₃⁻), Fluoride (F⁻) and Arsenic (As) concentrations. Policy-makers and groundwater managers can utilize the proposed integrated novel approach and the outcome of groundwater vulnerability maps to attain sustainable groundwater development and safeguard human-induced activities at the regional level.
Estimation of missing values in astronomical survey data: An improved local approach using cluster directed neighbor selection
2022, Information Processing and Management
Citation Excerpt :
Only a subset of samples that exhibits high correlation with the one containing missing values is used to approximate estimates. Examples of these include K-nearest neighbor imputation or KNNimpute (Jordanov, Petrov, & Petrozziello, 2018; Ma, Tian, Liu, & Zhang, 2020), and local least square imputation or LLSimpute (Wang et al., 2019). Specific to the former, several extensions have been proposed over the years.
The work presented in this paper aims to develop new imputation methods to better handle missing values encountered in astronomical data analysis, especially the classification of transient events in a sky survey from the Gravitational wave Optical Transient Observatory (GOTO) project. In particular, the framework of cluster directed selection of neighbors that has proven effective for benchmark local imputation techniques of KNNimpute and LLSimpute are extended to new multi-stage models. The proposed models, namely Iterative-CKNN and Iterative-CLLS, are novel with an original application to analyze sky survey data. They bring out advantages from both local approaches, where estimates are summarized from neighbors in the same data cluster, within the iterative process to refine previous guesses. Based on experiments with simulated datasets corresponding to different survey sizes and missing rations between 1 to 20%, they usually outperform baseline models and Bayesian Principal Component Analysis (BPCA), which is the well-known global technique. For instance, at 10% missing rate, Iterative-CLLS appears to be the most accurate with NRMSE score of 0.190, while BPCA and the best among its baseline models reaches 0.351 and 0.249, respectively. For their practical implications, these methods have proven to be effective for classifying transients, using common algorithms like KNN, Naive Bayes and Random Forest.
Classification of white blood cells using deep features obtained from Convolutional Neural Network models based on the combination of feature selection methods
2020, Applied Soft Computing Journal
White blood cells are cells in the blood and lymph tissue produced by the bone marrow in the human body. White blood cells are an important part of the immune system. The most important task of these cells is to protect the human body against foreign invaders and infectious diseases. When the number of white blood cells in the blood is not enough for the human body, it can cause leukopenia. As a result of this situation, the resistance of the human body against infections and diseases decreases. In this respect, determining the number of these cells in the human body is a specialist task. Detection and treatment of this symptom is a labor-intensive process carried out by specialist doctors and radiologists. Image processing techniques have recently been widely used in biomedical systems for the diagnosis of various diseases. In this study, it is aimed to use image processing techniques to improve the classification performance of deep learning models in white blood cells classification. To perform the classification process more efficiently, the Maximal Information Coefficient and Ridge feature selection methods were used in conjunction with the Convolutional Neural Network models. The Maximal Information Coefficient and Ridge feature selection methods extracted the most relevant features. Afterward, the classification process was realized by using this feature set. In this study, AlexNet, GoogLeNet, and ResNet-50 were used as feature extractor and quadratic discriminant analysis was used as a classifier. As a result, the overall success rate was obtained as 97.95% in the classification of white blood cells. The experimental results showed that the use of the convolutional neural network models with feature selection methods contributed to improving the classification success of white blood cell types.
Delicately Reinforced k-Nearest Neighbor Classifier Combined With Expert Knowledge Applied to Abnormity Forecast in Electrolytic Cell
2024, IEEE Transactions on Neural Networks and Learning Systems

View all citing articles on Scopus

View full text

A new incomplete pattern belief classification method with multiple estimations based on KNN

Abstract

Introduction

Section snippets

Background knowledge

Belief classification with multiple estimations

Experiment applications

Discussion

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Appl. Soft Comput.

Knowl.-Based Syst.

Biomol. Eng.

Appl. Soft Comput.

Pattern Recognit.

J. Syst. Softw.

Eng. Appl. Artif. Intell.

Inf. Fusion

Neurocomputing

Appl. Soft Comput.

Knowl.-Based Syst.

Appl. Soft Comput.

Appl. Soft Comput.

Neural Netw.

Pattern Recognit.

Pattern Recognit.

UCI Machine Learning Repository

A subspace ensemble framework for classification with high dimensional missing data

Multidimens. Syst. Signal Process.

A new incomplete pattern classification method based on evidential reasoning

IEEE Trans. Cybern.

The treatment of missing values and its effect on classifier accuracy, classification clustering & data mining applications