A novel purity-based k nearest neighbors imputation method and its application in financial distress prediction

doi:10.1016/j.engappai.2019.03.003

Engineering Applications of Artificial Intelligence

Volume 81, May 2019, Pages 283-299

https://doi.org/10.1016/j.engappai.2019.03.003 Get rights and content

Abstract

Financial distress research often has missing values problems, and the different missing values handling techniques have an impact on the classification results. Furthermore, missing values handling in the data sciences is an important issue, and the different missing values handling approaches restrict on the application and performance of the classification. In missing values research, previous studies usually focused on the accuracy of classification, however, they address less the overall performance of the different missing degrees. To obtain better accuracy and maintain the integrity of data on the classification, this study proposes a purity-based k nearest neighbor algorithm to improve the performance of the missing value imputation. To verify, this study implemented different missing degree and different noise rate experiments for demonstrating the better performance because the proposed method is less affected by the noise. Furthermore, this paper also implemented MAR, MCAR, and MNAR type experiments, and compared the proposed method with the listed imputation techniques. Furthermore, this study practically collected Taiwan Economic Journal (TEJ) datasets as MNAR type missing values, and then employed the proposed purity-based k nearest neighbor algorithm to build a financial distress prediction model. Finally, this study compared the proposed imputation algorithm with common imputation methods and different classifiers, the results show that the proposed imputation algorithm obtains better accuracy and more stable in different missing degrees and noise.

Introduction

Financial distress (or financial crisis) is a business situation in which cash flow is not sufficient to pay a debt. Financial crisis prediction is an important and challenging research topic. Since 1966 (Beaver, 1966), many methods have been used to predict corporate bankruptcy and financial crisis, including artificial intelligence and statistical methods; many studies have shown that artificial intelligence is better than traditional statistical methods (Jerez et al., 2010). Additionally, the financial distress model predicts whether a company will fall into financial distress based on the recent financial data, which can be predicted by mathematical, statistical or data mining techniques (Sun et al., 2014). To allow enterprises, financial institutions and investors to take preventive or remedial actions as soon as possible before the financial crisis, it is necessary to build a method to warn of financial crisis.

In financial practice however, financial statement of enterprise is often quite limited, which makes modeling challenging and available data precious. In addition to the limited nature of data, the existing data are usually impaired by incomplete records. Therefore, the unavailability of these records particularly amply the problem of scarce data. Moreover, many standard statistical procedures require complete data. To build a good financial distress early warning model, this study proposes a novel missing value imputation method with a theoretical basis for stakeholders.

Handling missing values is performed during the pre-process of data mining. The pre-process is a necessary procedure for obtaining a better outcome. Without carefully handling missing values in pre-processing, the outcomes of analysis may distort the facts. Therefore, missing values handling is an important procedure in pre-processing. For handling missing values, many researchers have proposed various types of missing value handling techniques. However, most of the research focused on classification accuracy and ignored the effect of different missing degrees of data that could result in a questionable outcome. Simultaneously, the predicted values are also susceptible to outliers (or noise). Many missing values handling techniques removed outliers from the dataset, then either performed imputation or contained the outliers to perform prediction. One is a violation of the spirit of information science, and the other is the result of unreal output.

Addressing the financial distress data with missing values problem, many studies tend to use traditional statistical methods, such as the listwise approach (Allison, 2002), hot deck imputation (Andridge and Little, 2010) and cold deck imputation (Shao, 2000). Furthermore, financial distress focuses on two-class labels (health or distress). Based on the previously mentioned problems of outliers, artificial intelligence is better than traditional statistical methods. To obtain better results, many researchers proposed multiple imputation techniques, where distinct estimate techniques imputed missing values. However, removing instance types remains a concern, and they do not discuss the different degrees of missing values. Therefore, this paper proposes a new imputation algorithm to handle both different missing degrees and maintaining all instances. This paper has four contributions as follows:

(1)
Propose a new imputation algorithm based on purity k nearest neighbors imputation (PkNNI) for missing values imputation, and demonstrate the effects of noise on the proposed imputation method.
(2)
Compare the performance of different missing degrees for the parameter combination of the proposed imputation method.
(3)
Compare the performance of different imputation methods with the proposed imputation method.
(4)
Build a financial distress prediction model based on the proposed imputation techniques.

The rest of this paper is organized as follows: Section 2 describes the related work including financial distress, type of missing values, and imputation techniques. Section 3 introduces the concept and procedure of the proposed method. Section 4 is experimental framework, environment, datasets description, and experimental results. The conclusion is in Section 5.

Section snippets

Related work

In this section, the related literature and concept of missing values and k nearest neighbor technique are introduced in the following.

Proposed method

Because the missing values problem often uses the deletion approach in the financial distress field, it is possible to remove key information in datasets. Many studies use the traditional statistical methods in imputation techniques, such as the listwise approach (Allison, 2002), hot deck imputation (Andridge and Little, 2010) and cold deck imputation (Shao, 2000). Additionally, many missing values handling techniques remove outliers from the collected dataset, and artificial intelligence

Experiments and results

To verify the effectiveness of the proposed method, this study implemented the noise experiment to demonstrate that the better performance due to the proposed method is less affected by the noise. The experiment procedure followed Section 3’s proposed procedure to demonstrate the effects of the proposed method for the UCI datasets in the MAR and MCAR experiments, and the practically collected financial dataset in the MNAR experiments. Eight different types of datasets were chosen from the UCI

Conclusions

To treat the missing values problem, this study proposed a new imputation method for handling missing values, which can filter outliers and noise through a purity computation. This study implemented three types of experiments, including the use of UCI datasets in MAR and MCAR experiments and the use of TEJ datasets in MNAR experiments. The experimental results show that the proposed method can perform better in both the UCI datasets and TEJ datasets. Because the proposed imputation method

Acknowledgment

We would like to thank Ministry of Science and Technology of Taiwan, this research was partially supported by the research project of Taiwan Ministry of Science and Technology (MOST 107-2221-E-224-036).

References (37)

AmiriM. et al.
Missing data imputation using fuzzy-rough methods
Neurocomputing
(2016)
DattaS. et al.
A feature weighted penalty based dissimilarity measure for k-nearest neighbor classification with missing features
Pattern Recognit. Lett.
(2016)
DingY. et al.
Forecasting financial condition of chinese listed companies based on support vector machine
Expert Syst. Appl.
(2008)
DondersA.R.T. et al.
Review: a gentle introduction to imputation of missing values
J. Clin. Epidemiol.
(2006)
García-LaencinaP.J. et al.
K nearest neighbours with mutual information for simultaneous classification and missing data imputation
Neurocomputing
(2009)
GarciarenaU. et al.
An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers
Expert Syst. Appl.
(2017)
JerezJ.M. et al.
Missing data imputation using statistical and machine learning methods in a real breast cancer problem
Artif. Intell. Med.
(2010)
LiH. et al.
Ranking-order case-based reasoning for financial distress prediction
Knowl.-Based Syst.
(2008)
LiH. et al.
Financial distress prediction based on OR-CBR in the principle of k-nearest neighbors
Expert Syst. Appl.
(2009)
LinF. et al.
Financial ratio selection for business crisis prediction
Expert Syst. Appl.
(2011)

LiuZ. et al.

Adaptive imputation of missing values for incomplete pattern classification

Pattern Recognit.

(2016)

SunJ. et al.

Predicting financial distress and corporate failure: A review from the state-of-the-art definitions, modeling, sampling, and featuring approaches

Knowl.-Based Syst.

(2014)

TsaiC.F. et al.

Combining instance selection for better missing value imputation

J. Syst. Softw.

(2016)

XiaJ. et al.

Adjusted weight voting algorithm for random forests in handling missing values

Pattern Recognit.

(2017)

ZhouL.G. et al.

The performance of corporate financial distress prediction models with features selection guided by domain knowledge and data mining approaches

Knowl.-Based Syst.

(2015)

AbdiansahA. et al.

Time complexity analysis of support vector machines (SVM) in LibSVM

Int. J. Comput. Appl.

(2015)

AcuñaE. et al.

The treatment of missing values and its effect on classifier accuracy

AllisonP.D.

Missing data: Quantitative applications in the social sciences

Br. J. Math. Stat. Psychol.

(2002)

Cited by (51)

Efficient imputation of missing data using the information of local space defined by the geometric one-class classifier
2024, Expert Systems with Applications
Datasets gathered from actual systems may include missing data owing to unintentional faults, such as the breakdown of equipment as well as intentional reasons such as sampling inspection. Because missing data can result in incorrect and distorted results when analyzed, they should be addressed before the analysis is performed. Imputation of missing data involves replacing missing entries of data with values calculated from observed features, which is a more reasonable alternative than simple methods, including a complete case analysis. Although various imputation methods exist for missing data, most ignore the local space around it, which may be closely related to missing values. Furthermore, the imputation method, which can partially reflect local relationships, is susceptible to overfitting and has parameter tuning issues owing to the lack of a systematic definition of the local space. Thus, we propose a composite fuzzy hyper-rectangle (H-RTGL) imputation (CFHRI) method with the following characteristics: (i) it defines the local space using an H-RTGL-based one-class classifier to thoroughly describe the data of the target class, and (ii) it imputes the missing entries using a fuzzy model comprising imputation models calculated from H-RTGLs. These features enable CFHRI to formulate the local space adjacent to missing data systematically and alleviate the hazards of overfitting into a certain region of the dataset. We validated our method based on numerical experiments conducted using a dataset gathered from an actual system and comparison of the imputation performance of our method with that of other imputation methods. CFHRI showed statistically significant improvement in 5 datasets among 7 datasets used, and around 10% enhanced in terms of Mean Absolute Error (MAE). Moreover, we could achieve 3–5% of increased classification accuracy of imputed dataset, which indicates CFHRI can be a useful pre-processor of dataset whose purpose is classification.
Summarising multiple clustering-centric estimates with OWA operators for improved KNN imputation on microarray data
2023, Fuzzy Sets and Systems
As part of celebrating the success of OWA operators and their contributions over the past decades, this work presents an original investigation of exploiting OWA in dealing with missing value imputation witnessed in microarray experimental data. This task is significant in life science and its realisation to humanity. Both argument-independent and -dependent variants of such operators are applied to summarise a collection of estimates, determined through the concept of clustering-centric KNN imputation. This provides an innovative alternative to the state-of-the-art model that makes use of a single clustering to identify neighbours of a particular instance of interest. Instead of manually specify data partition, the proposed approach works by selecting a subset of diverse clusterings or committees from a candidate pool, which has been prepared using k-means and different (and popular) generation strategies invented for ensemble clustering. It is automated through a greedy forward-search looking for a desired number of committee members. Based on published gene expression datasets and different experimental settings, the resulting model generally outperforms its baselines, being competitive to related methods found in the literature. Further extensions to iterative refinement and supervised imputation are also discussed in addition to the analysis of algorithmic parameters.
A generic sparse regression imputation method for time series and tabular data
2023, Knowledge-Based Systems
Although many missing data imputation methods have been proposed in the relevant literature, they focus on either time series or tabular data, but not on both. Hence, a generic sparse regression method for missing data imputation is proposed. The imputed values of a target feature are generated by solving a sparse least squares problem using a preconditioned iterative method based on generic approximate sparse pseudoinverse. Sparsity is introduced by dummy encoding existing or constructed (through discretization) categorical features. Extensive experiments were conducted on several datasets, and the results demonstrate the effectiveness of the method for both time series and tabular data.
A case-based reasoning driven ensemble learning paradigm for financial distress prediction with missing data
2023, Applied Soft Computing
Financial distress prediction is often accompanied by missing sample data. For this purpose, a novel case-based reasoning (CBR) driven ensemble learning paradigm is proposed for financial distress prediction with missing data. In the proposed paradigm, three main stages, CBR-driven missing data imputation, CBR-driven single classifiers prediction, and CBR-driven ensemble result output, are involved. In the first stage, the CBR-driven missing data imputation method is used to fill in missing values in the initial dataset. Second, three different CBR-driven single classification models are constructed using Manhattan distance, Euclidean distance, and cosine distance to predict financial distress, respectively. In the final stage, the weighted majority voting strategy is used to ensemble prediction results of the CBR-driven single classification models to improve prediction accuracy and robustness. For illustration and verification, the experiments on datasets with different missing rates of six Chinese listed companies are performed. And corresponding results show that the proposed CBR-driven ensemble learning paradigm can effectively improve the imputation performance and increase the robustness of classification performance, indicating that the proposed CBR-driven ensemble learning paradigm can be used as a competitive solution to financial distress prediction with missing data.
The impact of heterogeneous distance functions on missing data imputation and classification performance
2022, Engineering Applications of Artificial Intelligence
This work performs an in-depth study of the impact of distance functions on K-Nearest Neighbours imputation of heterogeneous datasets. Missing data is generated at several percentages, on a large benchmark of 150 datasets (50 continuous, 50 categorical and 50 heterogeneous datasets) and data imputation is performed using different distance functions (HEOM, HEOM-R, HVDM, HVDM-R, HVDM-S, MDE and SIMDIST) and $k$ values (1, 3, 5 and 7). The impact of distance functions on kNN imputation is then evaluated in terms of classification performance, through the analysis of a classifier learned from the imputed data, and in terms of imputation quality, where the quality of the reconstruction of the original values is assessed. By analysing the properties of heterogeneous distance functions over continuous and categorical datasets individually, we then study their behaviour over heterogeneous data. We discuss whether datasets with different natures may benefit from different distance functions and to what extent the component of a distance function that deals with missing values influences such choice. Our experiments show that missing data has a significant impact on distance computation and the obtained results provide guidelines on how to choose appropriate distance functions depending on data characteristics (continuous, categorical or heterogeneous datasets) and the objective of the study (classification or imputation tasks).
Handling missing data through deep convolutional neural network
2022, Information Sciences
Citation Excerpt :
K-nearest neighbor imputation (KNNI) is considered one of the most popular techniques due to its simplicity and effectiveness compared to other approaches. In [13], purity k nearest neighbors imputation (PkNNI) was proposed as an extension of the traditional KNNI method, which is based on purity training and imputation. In this method, the purity of a record is computed by aggregating the votes of records that are selected as their nearest neighbours.
The presence of missing data is a challenging issue in processing real-world datasets. It is necessary to improve the data quality by imputing the missing values so that effective learning from data can be achieved. Recently, deep learning has become the most powerful type of machine learning techniques, which can be used for discovering the hidden knowledge that exists in a large dataset to make accurate predictions. In this paper, we propose an imputation method that involves using a convolutional neural network to impute the missing values. The missing value of each instance is imputed essentially by using a trained kernel. The weights of the kernel are determined by learning from the given data that are arranged spatially in the data matrix. The kernel carries out a weighted sum of neighboring elements in an array for imputing the missing values. In addition, in the absence of the true values with which the missing values are expected to be replaced, a loss function is designed without the need to know the true value. Our method is evaluated on UCI datasets in comparison with state-of-the-art methods. The experimental results show that the proposed approach performs closely to or better than other methods.

View all citing articles on Scopus

^☆: No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.03.003..

View full text

A novel purity-based k nearest neighbors imputation method and its application in financial distress prediction☆

Abstract

Introduction

Section snippets

Related work

Proposed method

Experiments and results

Conclusions

Acknowledgment

Neurocomputing

Pattern Recognit. Lett.

Expert Syst. Appl.

J. Clin. Epidemiol.

Neurocomputing

Expert Syst. Appl.

Artif. Intell. Med.

Knowl.-Based Syst.

Expert Syst. Appl.

Expert Syst. Appl.

Pattern Recognit.

Knowl.-Based Syst.

J. Syst. Softw.

Pattern Recognit.

Knowl.-Based Syst.

Time complexity analysis of support vector machines (SVM) in LibSVM

Int. J. Comput. Appl.

The treatment of missing values and its effect on classifier accuracy

Missing data: Quantitative applications in the social sciences

Br. J. Math. Stat. Psychol.