Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling
Introduction
Class imbalance problem is a challenging issue in data mining community because most of the data mining algorithms always pay more attention to learn from the major samples. Some minor samples are often ignored and misclassified into major sample sets, which leads to high errors or low precisions of the discovered patterns. In class imbalance problem one or several classes only include much fewer data instances than others, while these fewer instances probably play a key role in some data mining tasks. For example, a great many of texts are published in the Internet every day, and we want to analyze these data and try to discovery the relations among hot topics or events. These topics or events are minor among all data in the Internet but they are important. Similar situations in real-world applications include medical diagnose, fraud detection, telecommunication analysis, Web mining, etc.
Most of the recent efforts focus on two-class imbalanced datasets, that is, one is minority class (also called positive class), and the other is majority class (also called negative class). However, multi-class imbalance problems also occur frequently. For example, some local events on the Web emerge during different periods. These local data instances are minority compared with other global event data but they are identically important with the global majority for considering different goals. Therefore, in this paper, we not only consider two-class imbalanced data clustering problem, but also attempt to solve multi-class problem.
There are two kinds of methods available for learning from imbalanced data. One is preprocessing approach such as sampling and feature selection, the other is algorithmic approach, such as ensembles of classifiers and modifications of current algorithms, which deals with imbalanced data by designing algorithms to construct learners. Up to now, most of the above approaches are fit for binary-class imbalance problem. In recent years, multi-class imbalanced data has attracted an increasing number of minds. However, most of the existed approaches are applicable for supervised classification area. There are only few discussions focusing on imbalanced data clustering problem.
In this paper, we contribute to study imbalanced data clustering algorithm via selection of probability models and parameter evolutionary estimation for these models. The reason why we apply probabilistic approach to class imbalance problem will be analyzed in the next section. In addition, the next section recalls the related work and their characteristics. Section 3 contains preliminaries and background including model selection, and parameter evolutionary estimation. In Section 4, we detail the proposed approach based on probability model selection without over- and under-sampling. Section 5 introduces the experimental setup including the data sets, selected probability models with their corresponding parameters and the statistical testing, and then presents the experimental results and analysis over the most significant algorithms for imbalanced data. Finally, in Section 6, we present our concluding remarks.
Section snippets
Related work
In this section, we first introduce related work about the class imbalance problem in classification and clustering. Then, we introduce the hot topics on imbalanced data clustering.
Preliminaries and proposed concepts
In this section, we will briefly introduce several basic concepts employed in our proposed approach such as model selection and probability-based model selection, evolutionary computation and evolutionary computation-based parameter estimation. We also specify the advantages of applying probability model selection approach to class imbalance problem.
Proposed approach
In this section, we will introduce our proposed approach, of which we first give the formal description.
Let be a probability model and be the parameter set. According to Definition 1 we can obtain:where m is the number of models in .
In order to get the optimal values of the parameters in , we employ evolutionary computation to solve the problem:where EO(·) represents the
Experimental results
In this section we give experiments on 19 datasets using our proposed algorithm and other classical clustering algorithms. Section 5.1 explains the experimental settings including experimental datasets, compared algorithms, and evaluation metrics. Section 5.2 presents the sensitivity analysis of the parameters used in our algorithm. Section 5.3 presents the detailed experimental results.
Conclusions
As the exponential growth of data, class imbalance problems have become outstanding issues. There are many challenges to be solved seriously. But the traditional theories, methods and techniques are not sufficient to handle those challenges hidden in the imbalanced data because those traditional solutions need enough data samples, otherwise it is difficult to obtain good learners or well performances. In imbalanced data the positive data instances are much less than the negative ones. It is
Acknowledgements
We would like to thank the anonymous reviewers for their valuable comments and suggestions. The authors also wish to thank the students of the laboratory Zheng Feng, Wenhua Liu, Xinghui Zhao and Xuan Li for participating in the experiment for several months. This work was supported by the National Natural Science Foundation of China (No.61203305, 61433012, and 61303167), and the Special Funds of Taishan Scholars Construction Project.
Jiancong Fan received his M.Sc. and Ph.D. degrees in computer science from College of Information Science and Engineering, Shandong University of Science and Technology, China, in 2003 and 2010, respectively. He joined Department of Computer Science and Technology of Shandong University of Science and Technology in 2003 as a teaching assistant. His current research interests include machine learning, data mining. He has worked on learning from labeled and unlabeled data, evolutionary learning,
References (41)
- et al.
Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance
Neural Netw.
(2008) - et al.
Learning from imbalanced data in surveillance of nosocomial infection
Artif. Intell. Med.
(2006) - et al.
A learning method for the class imbalance problem with medical data sets
Comput. Biol. Med.
(2010) - et al.
Ensemble-based hybrid probabilistic sampling for imbalanced data learning in lung nodule CAD
Comput. Med. Imaging Graph.
(2014) - et al.
Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem
Nonlinear Anal.: Real World Appl.
(2006) - et al.
A class-oriented feature selection approach for multi-class imbalanced network traffic datasets based on local and global metrics fusion
Neurocomputing
(2015) - et al.
Multiple extreme learning machines for a two-class imbalance corporate life cycle prediction
Knowl.-Based Syst.
(2013) - et al.
A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios
Pattern Recognit. Lett.
(2013) - et al.
Parameter-free classification in multi-class imbalanced data sets
Data Knowl. Eng.
(2013) - et al.
Track on intelligent computing and applications
Neurocomputing
(2014)
Predicting disease risks from highly imbalanced data using random forest
BMC Med. Inf. Decis. Mak.
Minority report in fraud detection: classification of skewed data
ACM SIGKDD Explor. Newsl.
Effective detection of sophisticated online banking fraud on extremely imbalanced data
World Wide Web
Unknown malcode detection and the imbalance problem
J. Comput. Virol.
Using class imbalance learning for software defect prediction
IEEE Trans. Reliab.
An experimental study on evolutionary fuzzy classifiers designed for managing imbalanced datasets
Neurocomputing
Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms
Neurocomputing
FSVM-CIL: fuzzy support vector machines for class imbalance learning
IEEE Trans. Fuzzy Syst.
A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches
IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev.
Cited by (19)
Tree-based space partition and merging ensemble learning framework for imbalanced problems
2019, Information SciencesCitation Excerpt :For example, the Parsimonious Ensemble (pENsemble) [29] focuses on the diversity of basic classifiers, and assures diversity by means of intelligent sampling strategies, thereby achieving excellent results. Several methods [14,22,40] satisfy diversity by using the clustering information to divide the data into clusters. To overcome the drawbacks mentioned above, this paper proposes a tree-based space partition and merging ensemble learning framework, known as the space partition tree (SPT), to address the imbalanced problem.
Improving k-means through distributed scalable metaheuristics
2017, NeurocomputingCitation Excerpt :Evolutionary algorithms are metaheuristics widely believed to be able to provide satisfactory suboptimal solutions in acceptable time. Probably for this reason, several evolutionary approaches for clustering problems have been proposed in this journal [9,10,18–22] and in the broader literature (e.g. see the monograph by Falkenauer [23] and the survey by Hruschka et al. [24] for extensive overviews). Of special interest here are those approaches based on the use of k-means as a local search operator to refine the global search performed by the evolutionary procedure.
Learning from class-imbalanced data: Review of methods and applications
2017, Expert Systems with ApplicationsA novel rough semi-supervised k-means algorithm for text clustering
2023, International Journal of Bio-Inspired ComputationA Novel Fuzzy Distance-Based Minimum Spanning Tree Clustering Algorithm for Face Detection
2022, Cognitive Computation
Jiancong Fan received his M.Sc. and Ph.D. degrees in computer science from College of Information Science and Engineering, Shandong University of Science and Technology, China, in 2003 and 2010, respectively. He joined Department of Computer Science and Technology of Shandong University of Science and Technology in 2003 as a teaching assistant. His current research interests include machine learning, data mining. He has worked on learning from labeled and unlabeled data, evolutionary learning, and industrial data analysis. Up until now, He has published over 30 papers in national and international journals or conferences. Currently, he serves as the reviewer of Information Sciences, Evolutionary Computation, etc.
Zhonghan Niu is currently a master candidate at College of Information Science and Engineering, Shandong University of Science and Technology. He received the B.S. Degree in School of Information Science and Engineering, University of Jinan, China, in 2014. His current research interests are machine learning and data mining.
Yongquan Liang received his M.Sc. and Ph.D. degrees in computer science from Beihang University and Institute of Computing Technology, Chinese Academy of Sciences, China, in 1992 and 1999, respectively. He was a visiting scholar at Institute of AIFB, Karlsruhe University. He is now a professor of Shandong University of Science and Technology. His current research interests include data mining and intelligent information processing. Up until now, He has published over 100 papers in national and international journals or conferences.
Zhongying Zhao received her Ph.D degree in computer science, from Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), 2012. She worked in Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences (CAS), as an assistant professor from 2012 to 2014. She is currently an assistant professor in College of Information Science and Engineering, Shandong University of Science and Technology. She has published 16 papers in international journals and conference proceedings. Her research interests focus on social network analysis and data mining.