Elsevier

Neurocomputing

Volume 211, 26 October 2016, Pages 172-181
Neurocomputing

Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling

https://doi.org/10.1016/j.neucom.2015.10.140Get rights and content

Abstract

Data imbalance problems arisen from the accumulated amount of data, especially from big data, have become a challenging issue in recent years. In imbalanced data, those minor data sets probably imply much important patterns. Although there are some approaches for discovering class patterns, an emerging issue is that few of them have been applied to cluster minor patterns. In common, the minor samples are submerged in big data, and they are often ignored and misclassified into major patterns without supervision of training set. Since clustering minorities is an uncertain process, in this paper, we employ model selection and evolutionary computation to solve the uncertainty and concealment of the minor data in imbalanced data clustering. Given data set, model selection is to select a model from a set of candidate models. We select probability models as candidate models because they can solve uncertainty effectively and thereby are well-suited to data imbalance. Considering the difficulty of estimating the models' parameters, we employ evolutionary process to adjust and estimate the optimal parameters. Experimental results show that our proposed approach for clustering imbalanced data has the ability of searching and discovering minor patterns, and can also obtain better performances than many other relevant clustering algorithms in several performance indices.

Introduction

Class imbalance problem is a challenging issue in data mining community because most of the data mining algorithms always pay more attention to learn from the major samples. Some minor samples are often ignored and misclassified into major sample sets, which leads to high errors or low precisions of the discovered patterns. In class imbalance problem one or several classes only include much fewer data instances than others, while these fewer instances probably play a key role in some data mining tasks. For example, a great many of texts are published in the Internet every day, and we want to analyze these data and try to discovery the relations among hot topics or events. These topics or events are minor among all data in the Internet but they are important. Similar situations in real-world applications include medical diagnose, fraud detection, telecommunication analysis, Web mining, etc.

Most of the recent efforts focus on two-class imbalanced datasets, that is, one is minority class (also called positive class), and the other is majority class (also called negative class). However, multi-class imbalance problems also occur frequently. For example, some local events on the Web emerge during different periods. These local data instances are minority compared with other global event data but they are identically important with the global majority for considering different goals. Therefore, in this paper, we not only consider two-class imbalanced data clustering problem, but also attempt to solve multi-class problem.

There are two kinds of methods available for learning from imbalanced data. One is preprocessing approach such as sampling and feature selection, the other is algorithmic approach, such as ensembles of classifiers and modifications of current algorithms, which deals with imbalanced data by designing algorithms to construct learners. Up to now, most of the above approaches are fit for binary-class imbalance problem. In recent years, multi-class imbalanced data has attracted an increasing number of minds. However, most of the existed approaches are applicable for supervised classification area. There are only few discussions focusing on imbalanced data clustering problem.

In this paper, we contribute to study imbalanced data clustering algorithm via selection of probability models and parameter evolutionary estimation for these models. The reason why we apply probabilistic approach to class imbalance problem will be analyzed in the next section. In addition, the next section recalls the related work and their characteristics. Section 3 contains preliminaries and background including model selection, and parameter evolutionary estimation. In Section 4, we detail the proposed approach based on probability model selection without over- and under-sampling. Section 5 introduces the experimental setup including the data sets, selected probability models with their corresponding parameters and the statistical testing, and then presents the experimental results and analysis over the most significant algorithms for imbalanced data. Finally, in Section 6, we present our concluding remarks.

Section snippets

Related work

In this section, we first introduce related work about the class imbalance problem in classification and clustering. Then, we introduce the hot topics on imbalanced data clustering.

Preliminaries and proposed concepts

In this section, we will briefly introduce several basic concepts employed in our proposed approach such as model selection and probability-based model selection, evolutionary computation and evolutionary computation-based parameter estimation. We also specify the advantages of applying probability model selection approach to class imbalance problem.

Proposed approach

In this section, we will introduce our proposed approach, of which we first give the formal description.

Let f(x,Θ) be a probability model and Θ be the parameter set. According to Definition 1 we can obtain:f(x|Θ)f(x|argmaxfpi(x|θ1,θ2,,θt),i=1,2,,m),where m is the number of models in Θ.

In order to get the optimal values of the parameters in Θ, we employ evolutionary computation to solve the problem:ΘEO(Θ)=EO(θ1,θ2,,θt)=g(θ1,θ2,,θt|X)=g(θ1,θ2,,θt|x1,x2,,xn),where EO(·) represents the

Experimental results

In this section we give experiments on 19 datasets using our proposed algorithm and other classical clustering algorithms. Section 5.1 explains the experimental settings including experimental datasets, compared algorithms, and evaluation metrics. Section 5.2 presents the sensitivity analysis of the parameters used in our algorithm. Section 5.3 presents the detailed experimental results.

Conclusions

As the exponential growth of data, class imbalance problems have become outstanding issues. There are many challenges to be solved seriously. But the traditional theories, methods and techniques are not sufficient to handle those challenges hidden in the imbalanced data because those traditional solutions need enough data samples, otherwise it is difficult to obtain good learners or well performances. In imbalanced data the positive data instances are much less than the negative ones. It is

Acknowledgements

We would like to thank the anonymous reviewers for their valuable comments and suggestions. The authors also wish to thank the students of the laboratory Zheng Feng, Wenhua Liu, Xinghui Zhao and Xuan Li for participating in the experiment for several months. This work was supported by the National Natural Science Foundation of China (No.6120330561433012, and 61303167), and the Special Funds of Taishan Scholars Construction Project.

Jiancong Fan received his M.Sc. and Ph.D. degrees in computer science from College of Information Science and Engineering, Shandong University of Science and Technology, China, in 2003 and 2010, respectively. He joined Department of Computer Science and Technology of Shandong University of Science and Technology in 2003 as a teaching assistant. His current research interests include machine learning, data mining. He has worked on learning from labeled and unlabeled data, evolutionary learning,

References (41)

  • M. Khalilia et al.

    Predicting disease risks from highly imbalanced data using random forest

    BMC Med. Inf. Decis. Mak.

    (2011)
  • C. Phua et al.

    Minority report in fraud detection: classification of skewed data

    ACM SIGKDD Explor. Newsl.

    (2004)
  • M. Di Martino, F. Decia, J. Molinelli, A. Fernández. Improving electric fraud detection using class imbalance...
  • W. Wei et al.

    Effective detection of sophisticated online banking fraud on extremely imbalanced data

    World Wide Web

    (2013)
  • R. Moskovitch et al.

    Unknown malcode detection and the imbalance problem

    J. Comput. Virol.

    (2009)
  • S. Wang et al.

    Using class imbalance learning for software defect prediction

    IEEE Trans. Reliab.

    (2013)
  • A. Michela et al.

    An experimental study on evolutionary fuzzy classifiers designed for managing imbalanced datasets

    Neurocomputing

    (2014)
  • V. Piyanoot et al.

    Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms

    Neurocomputing

    (2015)
  • R. Batuwita et al.

    FSVM-CIL: fuzzy support vector machines for class imbalance learning

    IEEE Trans. Fuzzy Syst.

    (2010)
  • M. Galar et al.

    A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches

    IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev.

    (2012)
  • Cited by (19)

    • Tree-based space partition and merging ensemble learning framework for imbalanced problems

      2019, Information Sciences
      Citation Excerpt :

      For example, the Parsimonious Ensemble (pENsemble) [29] focuses on the diversity of basic classifiers, and assures diversity by means of intelligent sampling strategies, thereby achieving excellent results. Several methods [14,22,40] satisfy diversity by using the clustering information to divide the data into clusters. To overcome the drawbacks mentioned above, this paper proposes a tree-based space partition and merging ensemble learning framework, known as the space partition tree (SPT), to address the imbalanced problem.

    • Improving k-means through distributed scalable metaheuristics

      2017, Neurocomputing
      Citation Excerpt :

      Evolutionary algorithms are metaheuristics widely believed to be able to provide satisfactory suboptimal solutions in acceptable time. Probably for this reason, several evolutionary approaches for clustering problems have been proposed in this journal [9,10,18–22] and in the broader literature (e.g. see the monograph by Falkenauer [23] and the survey by Hruschka et al. [24] for extensive overviews). Of special interest here are those approaches based on the use of k-means as a local search operator to refine the global search performed by the evolutionary procedure.

    • A novel rough semi-supervised k-means algorithm for text clustering

      2023, International Journal of Bio-Inspired Computation
    View all citing articles on Scopus

    Jiancong Fan received his M.Sc. and Ph.D. degrees in computer science from College of Information Science and Engineering, Shandong University of Science and Technology, China, in 2003 and 2010, respectively. He joined Department of Computer Science and Technology of Shandong University of Science and Technology in 2003 as a teaching assistant. His current research interests include machine learning, data mining. He has worked on learning from labeled and unlabeled data, evolutionary learning, and industrial data analysis. Up until now, He has published over 30 papers in national and international journals or conferences. Currently, he serves as the reviewer of Information Sciences, Evolutionary Computation, etc.

    Zhonghan Niu is currently a master candidate at College of Information Science and Engineering, Shandong University of Science and Technology. He received the B.S. Degree in School of Information Science and Engineering, University of Jinan, China, in 2014. His current research interests are machine learning and data mining.

    Yongquan Liang received his M.Sc. and Ph.D. degrees in computer science from Beihang University and Institute of Computing Technology, Chinese Academy of Sciences, China, in 1992 and 1999, respectively. He was a visiting scholar at Institute of AIFB, Karlsruhe University. He is now a professor of Shandong University of Science and Technology. His current research interests include data mining and intelligent information processing. Up until now, He has published over 100 papers in national and international journals or conferences.

    Zhongying Zhao received her Ph.D degree in computer science, from Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), 2012. She worked in Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences (CAS), as an assistant professor from 2012 to 2014. She is currently an assistant professor in College of Information Science and Engineering, Shandong University of Science and Technology. She has published 16 papers in international journals and conference proceedings. Her research interests focus on social network analysis and data mining.

    View full text