A novel purity-based k nearest neighbors imputation method and its application in financial distress prediction

https://doi.org/10.1016/j.engappai.2019.03.003Get rights and content

Abstract

Financial distress research often has missing values problems, and the different missing values handling techniques have an impact on the classification results. Furthermore, missing values handling in the data sciences is an important issue, and the different missing values handling approaches restrict on the application and performance of the classification. In missing values research, previous studies usually focused on the accuracy of classification, however, they address less the overall performance of the different missing degrees. To obtain better accuracy and maintain the integrity of data on the classification, this study proposes a purity-based k nearest neighbor algorithm to improve the performance of the missing value imputation. To verify, this study implemented different missing degree and different noise rate experiments for demonstrating the better performance because the proposed method is less affected by the noise. Furthermore, this paper also implemented MAR, MCAR, and MNAR type experiments, and compared the proposed method with the listed imputation techniques. Furthermore, this study practically collected Taiwan Economic Journal (TEJ) datasets as MNAR type missing values, and then employed the proposed purity-based k nearest neighbor algorithm to build a financial distress prediction model. Finally, this study compared the proposed imputation algorithm with common imputation methods and different classifiers, the results show that the proposed imputation algorithm obtains better accuracy and more stable in different missing degrees and noise.

Introduction

Financial distress (or financial crisis) is a business situation in which cash flow is not sufficient to pay a debt. Financial crisis prediction is an important and challenging research topic. Since 1966 (Beaver, 1966), many methods have been used to predict corporate bankruptcy and financial crisis, including artificial intelligence and statistical methods; many studies have shown that artificial intelligence is better than traditional statistical methods (Jerez et al., 2010). Additionally, the financial distress model predicts whether a company will fall into financial distress based on the recent financial data, which can be predicted by mathematical, statistical or data mining techniques (Sun et al., 2014). To allow enterprises, financial institutions and investors to take preventive or remedial actions as soon as possible before the financial crisis, it is necessary to build a method to warn of financial crisis.

In financial practice however, financial statement of enterprise is often quite limited, which makes modeling challenging and available data precious. In addition to the limited nature of data, the existing data are usually impaired by incomplete records. Therefore, the unavailability of these records particularly amply the problem of scarce data. Moreover, many standard statistical procedures require complete data. To build a good financial distress early warning model, this study proposes a novel missing value imputation method with a theoretical basis for stakeholders.

Handling missing values is performed during the pre-process of data mining. The pre-process is a necessary procedure for obtaining a better outcome. Without carefully handling missing values in pre-processing, the outcomes of analysis may distort the facts. Therefore, missing values handling is an important procedure in pre-processing. For handling missing values, many researchers have proposed various types of missing value handling techniques. However, most of the research focused on classification accuracy and ignored the effect of different missing degrees of data that could result in a questionable outcome. Simultaneously, the predicted values are also susceptible to outliers (or noise). Many missing values handling techniques removed outliers from the dataset, then either performed imputation or contained the outliers to perform prediction. One is a violation of the spirit of information science, and the other is the result of unreal output.

Addressing the financial distress data with missing values problem, many studies tend to use traditional statistical methods, such as the listwise approach (Allison, 2002), hot deck imputation (Andridge and Little, 2010) and cold deck imputation (Shao, 2000). Furthermore, financial distress focuses on two-class labels (health or distress). Based on the previously mentioned problems of outliers, artificial intelligence is better than traditional statistical methods. To obtain better results, many researchers proposed multiple imputation techniques, where distinct estimate techniques imputed missing values. However, removing instance types remains a concern, and they do not discuss the different degrees of missing values. Therefore, this paper proposes a new imputation algorithm to handle both different missing degrees and maintaining all instances. This paper has four contributions as follows:

  • (1)

    Propose a new imputation algorithm based on purity k nearest neighbors imputation (PkNNI) for missing values imputation, and demonstrate the effects of noise on the proposed imputation method.

  • (2)

    Compare the performance of different missing degrees for the parameter combination of the proposed imputation method.

  • (3)

    Compare the performance of different imputation methods with the proposed imputation method.

  • (4)

    Build a financial distress prediction model based on the proposed imputation techniques.

The rest of this paper is organized as follows: Section 2 describes the related work including financial distress, type of missing values, and imputation techniques. Section 3 introduces the concept and procedure of the proposed method. Section 4 is experimental framework, environment, datasets description, and experimental results. The conclusion is in Section 5.

Section snippets

Related work

In this section, the related literature and concept of missing values and k nearest neighbor technique are introduced in the following.

Proposed method

Because the missing values problem often uses the deletion approach in the financial distress field, it is possible to remove key information in datasets. Many studies use the traditional statistical methods in imputation techniques, such as the listwise approach (Allison, 2002), hot deck imputation (Andridge and Little, 2010) and cold deck imputation (Shao, 2000). Additionally, many missing values handling techniques remove outliers from the collected dataset, and artificial intelligence

Experiments and results

To verify the effectiveness of the proposed method, this study implemented the noise experiment to demonstrate that the better performance due to the proposed method is less affected by the noise. The experiment procedure followed Section 3’s proposed procedure to demonstrate the effects of the proposed method for the UCI datasets in the MAR and MCAR experiments, and the practically collected financial dataset in the MNAR experiments. Eight different types of datasets were chosen from the UCI

Conclusions

To treat the missing values problem, this study proposed a new imputation method for handling missing values, which can filter outliers and noise through a purity computation. This study implemented three types of experiments, including the use of UCI datasets in MAR and MCAR experiments and the use of TEJ datasets in MNAR experiments. The experimental results show that the proposed method can perform better in both the UCI datasets and TEJ datasets. Because the proposed imputation method

Acknowledgment

We would like to thank Ministry of Science and Technology of Taiwan, this research was partially supported by the research project of Taiwan Ministry of Science and Technology (MOST 107-2221-E-224-036).

References (37)

  • LiuZ. et al.

    Adaptive imputation of missing values for incomplete pattern classification

    Pattern Recognit.

    (2016)
  • SunJ. et al.

    Predicting financial distress and corporate failure: A review from the state-of-the-art definitions, modeling, sampling, and featuring approaches

    Knowl.-Based Syst.

    (2014)
  • TsaiC.F. et al.

    Combining instance selection for better missing value imputation

    J. Syst. Softw.

    (2016)
  • XiaJ. et al.

    Adjusted weight voting algorithm for random forests in handling missing values

    Pattern Recognit.

    (2017)
  • ZhouL.G. et al.

    The performance of corporate financial distress prediction models with features selection guided by domain knowledge and data mining approaches

    Knowl.-Based Syst.

    (2015)
  • AbdiansahA. et al.

    Time complexity analysis of support vector machines (SVM) in LibSVM

    Int. J. Comput. Appl.

    (2015)
  • AcuñaE. et al.

    The treatment of missing values and its effect on classifier accuracy

  • AllisonP.D.

    Missing data: Quantitative applications in the social sciences

    Br. J. Math. Stat. Psychol.

    (2002)
  • Cited by (51)

    • Handling missing data through deep convolutional neural network

      2022, Information Sciences
      Citation Excerpt :

      K-nearest neighbor imputation (KNNI) is considered one of the most popular techniques due to its simplicity and effectiveness compared to other approaches. In [13], purity k nearest neighbors imputation (PkNNI) was proposed as an extension of the traditional KNNI method, which is based on purity training and imputation. In this method, the purity of a record is computed by aggregating the votes of records that are selected as their nearest neighbours.

    View all citing articles on Scopus

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.03.003..

    View full text