Framework for efficient feature selection in genetic algorithm based data mining

doi:10.1016/j.ejor.2006.02.040

European Journal of Operational Research

Volume 180, Issue 2, 16 July 2007, Pages 723-737

https://doi.org/10.1016/j.ejor.2006.02.040 Get rights and content

Abstract

We present the design of more effective and efficient genetic algorithm based data mining techniques that use the concepts of feature selection. Explicit feature selection is traditionally done as a wrapper approach where every candidate feature subset is evaluated by executing the data mining algorithm on that subset. In this article we present a GA for doing both the tasks of mining and feature selection simultaneously by evolving a binary code along side the chromosome structure used for evolving the rules. We then present a wrapper approach to feature selection based on Hausdorff distance measure. Results from applying the above techniques to a real world data mining problem show that combining both the feature selection methods provides the best performance in terms of prediction accuracy and computational efficiency.

Introduction

The ubiquity of databases in almost every area of human endeavor has resulted in the rapid increase in the amount of data collected and stored in industry, government, and scientific organizations. Data in these databases vary in format and content ranging from trillions of point-of-sale transactions and credit card purchases to pixel images of distant galaxies. It is not uncommon to come across databases that are measured in terabytes of data. For example, Wal-Mart, the chain of over 2000 retail stores, uploads over 20 million point-of-sale transactions to an AT&T massively parallel system with more than 1000 processors running a centralized database every day [4]. Wal-Mart operates a data warehouse with over 583 terabytes of sales and inventory data. Although Moore’s law does not cover the growth curve of databases, the amount of data stored by businesses nearly doubles every 12–18 months. These collections of massive amounts of data have created opportunities to monitor, analyze and predict processes of interest. In today’s fierce competitive business environment, firms need to rapidly turn these terabytes of data into significant insights into their customers, markets, products and processes to guide their marketing, investment, and management strategies to gain competitive advantage.

Although these data contain buried valuable information that can be beneficially utilized, raw data by itself does not provide much information. Raw data must first be processed to extract patterns or useful knowledge. To this end, the development of effective and efficient methods for deriving knowledge from these data is becoming increasingly important [16]. Recent efforts under the general rubric of data mining represent a strategic response to this growing need for analytic tools that deliver in situations where traditional analytical methods fail. This is especially evident when dealing with unstructured data that are noisy and incomplete, and where scalability of traditional methods is an issue. By discovering hidden, implicit, and previously unknown and potentially useful patterns and relationships in the data, data mining enables users to extract greater value from their data than simple query and analyses approaches.

Some of the commonly used data mining algorithms fall under the following categories: decision trees and rules [26] nonlinear regression and classification methods [14], example-based methods [18], probabilistic graphical dependency models [30], and relational learning models [12]. Over the years genetic algorithms have been successfully applied in learning tasks in different domains, like chemical process control [27], financial classification [28], manufacturing scheduling [22], among others.

In this paper we propose frameworks for data mining using genetic algorithms, implement these, and evaluate their performance using examples. We chose genetic algorithm due to its simplicity and its capability as a powerful search mechanism. We present the design of more effective and efficient genetic algorithm based data mining techniques that use the concepts of self-adaptive feature selection together with a wrapper feature selection method based on Hausdorff distance measure. A genetic algorithm [17] uses a population of individual solution structures called chromosomes. The fitness of an individual solution is its performance measure. This measure is used to favor selection of successful parents for new offspring, such that the whole population of solutions incrementally evolves towards greater fitness. Offspring solutions are produced from parent solutions by the application of crossover and mutation operators. Theory shows that the knowledge about desirable solutions is advantageously stored in the population itself, implicitly contained in the surviving chromosomes. We take advantage of this principle in developing modified frameworks essentially using the genetic algorithm at its core.

The rest of this paper is organized as follows. We present a brief discussion on feature selection in the next section. In Section 3, we discuss genetic algorithm and its basic dynamics. We also include the first framework for the basic genetic algorithm case and show how we implement and evaluate this methodology. This is followed by the presentation of a modified filter-based genetic algorithm with embedded feature selection in Section 4. In Section 5, we provide a brief discussion of the Hausdorff distance measure, and then present the proposed framework of genetic algorithm with Hausdorff distance as wrapper-based feature selection. We discuss the effectiveness of feature selection in Section 6, including experimental results using a real world data set. Section 7 concludes this paper.

Section snippets

Feature selection

A typical real world data set consists of as many features that are deemed necessary. This is constrained by (1) knowledge of the domain of interest, and in turn knowledge of essential features that capture knowledge in this domain, (2) availability of these essential features, (3) resources available to gather these available essential features, (4) resources available to store, maintain, and retrieve these collected features. Given these constraints, it is clear that not all features that end

Genetic algorithm based data mining

In this section we present the design of a genetic algorithm for rule learning in a data mining application. Assume that the data mining problem has k attributes and we have a set of training examples $ψ = {(E_{t}, c) ∣ t = 1, \dots, T},$ where T is the total number of examples, each example E_t is a vector of k attribute values $E_{t} = [e_{t 1}, e_{t 2}, \dots, e_{tk}]$ and c is its classification value (usually a binary value indicating whether the example is a positive or a negative). The goal of data mining is to learn concepts that can

Genetic algorithm with feature selection as a filter (GAFS)

The GA presented above is modified as follows to incorporate the self-adaptive feature selection method. Each population member contains, in addition to the disjunct, a binary vector for feature selection that also evolves alongside the disjunct. A feature is selected if the corresponding bit in the selection vector is 1. For example, if we have five attributes in the original feature set a typical rule (disjunct) represented in the new approach would look like the following: $(\begin{matrix} X_{1} & X_{2} & X_{3} & X_{4} & X_{5} \\ [0 0.1] & [ \end{matrix})$

Hausdorff distance

Although Hausdorff distance measure is widely used in computer vision and graphics applications due to its excellent discriminant properties, it has not received much attention in the feature selection literature. The Hausdorff distance [24] is a measure of similarity, with respect to their position in metric space, of two non-empty compact sets A and B. It measures the extent to which each point in a set is located relative to those in another set. Let X₁ = {x₁₁, x₁₂, … , x_1m} and X₂ = {x₂₁, x₂₂, … , x_2n}

Problem domain

The problem is related to the control of a chemical process at an Eastman Kodak facility in Longview, TX, producing a certain chemical product, with about 30 process variables. In the process of producing the product, an undesirable byproduct is produced which is not measured directly. To remove this byproduct, an expensive chemical is added in just sufficient quantities. The problem is to change the controllable process variables (9 out of 30 variables) so that the usage of the expensive

Conclusions

Data mining algorithms perform well when concepts to be learned are mapped to space that is not complex. However, most real world data mining applications do not fall into this category, and thus must be able to perform well even in the presence of noise, irrelevant features, several local optima, among others. Evolutionary algorithms are excellent for performing global search, especially when the search landscape is complex with multiple global as well local optima. They have been shown to be

References (31)

A. Ahmed, Mohamed Deriche, An optimal feature selection technique using the concept of mutual information, in:...
H. Alt, B. Behrends, J. Bloemer, Approximate matching of polygonal shapes (extended abstract), in: Proceedings of the...
P.J. Angeline
Adaptive and self-adaptive evolutionary computations
C. Babcock, Parallel Processing Mines Retail Data, Computer World, 6 September...
R. Battiti
Using mutual information for selecting features in supervised neural net learning
IEEE Transactions on Neural Networks
(1994)
J.M. Benitez, J.L. Castro, C.J. Mantas, F. Rojas, A neuro-fuzzy approach for feature selection, in: Proceedings of IFSA...
P.S. Bradley et al.
Feature selection in mathematical programming
INFORMS Journal on Computing
(1998)
Bjorn Chambless, David Scarborough, Information theoretic feature selection for a neural behavioral model, in:...
Frans M. Coetzee, Eric Glover, Steve Lawrence, C. Lee Giles, Feature selection in web applications by ROC inflections...
T.M. Cover
The best two independent measurements are not the two best
IEEE Transactions on Systems, Man, and Cybernetics
(1974)

A. Csaszar

General Topology

(1978)

S. Dzeroski

Inductive logic programming and knowledge discovery in databases

J.D. Elashoff et al.

On the choice of variables in classification problems with dichotomous variables

Biometrika

(1967)

J. Elder et al.

A statistical perspective on knowledge discovery in databases

T. Elomaa, E. Ukkonen, A geometric approach to feature selection, in: Proceedings of the European Conference on Machine...

Cited by (95)

Classification of Algerian olive oils: Physicochemical properties, polyphenols and fatty acid composition combined with machine learning models
2024, Journal of Food Composition and Analysis
The regulation of olive cultivar and geographical origin is a requirement for the global extra virgin olive oil market, due to its significant impact on consumer choice. Our work involves obtaining a promising marker parameter for cultivar and geographical origin that can be used to verify declared labels. The effects of these factors on the physicochemical parameters and composition of monovarietal extra virgin olive oil (MEVOO) from Algeria were studied. Thirteen olive fruit varieties were analyzed using different physicochemical methods, including phenolic and fatty acid composition. Five classification techniques, random forests (RForest), gradient boosted trees (GBoost), Naïve Bayes (NBayes), logistic regression (LRegression) and decision tree (DTree), were applied and their results were compared. The best validation accuracy of 91.7 % was achieved with DT classification through a feature selection procedure using a genetic algorithm (GA). These results demonstrate the effective use of machine learning techniques to rapidly classify different Algerian varieties based on their compositional fingerprints.
A feature-thresholds guided genetic algorithm based on a multi-objective feature scoring method for high-dimensional feature selection
2023, Applied Soft Computing
The classical genetic algorithm utilizes random population initialization, an unguided crossover operator, and an unguided mutation operator for feature selection. However, this approach may be too stochastic and result in slow convergence. This paper proposes a hybrid feature selection algorithm named the Feature-Thresholds Guided Genetic Algorithm (FTGGA) to overcome this deficiency. FTGGA first employs ReliefF to filter out redundant features and retains crucial ones. Then, it generates a feature-thresholds set that contains all the feature thresholds. Each feature threshold represents the probability that the corresponding feature will be selected. The feature-thresholds set continuously updates to guide the iteration process of the genetic algorithm, accelerating its convergence. The experimental data demonstrates that FTGGA has a smaller feature subset and better classification accuracy compared to other algorithms.
A tutorial-based survey on feature selection: Recent advancements on feature selection
2023, Engineering Applications of Artificial Intelligence
Curse of dimensionality is known as big challenges in data mining, pattern recognition, computer vison and machine learning in recent years. Feature selection and feature extraction are two main approaches to circumvent this challenge. The main objective in feature selection is to remove the redundant features and preserve the relevant features in order to improve the learning algorithm performance. This survey provides a comprehensive overview of state-of-art feature selection techniques including mathematical formulas and fundamental algorithm to facilitate understanding. This survey encompasses different approaches of feature selection which can be categorized to five domains including: A) subspace learning which involves matrix factorization and matrix projection, B) sparse representation learning which includes compressed sensing and dictionary learning, C) information theory which covers multi-label neighborhood entropy, symmetrical uncertainty, Monte Carlo and Markov blanket, D) evolutionary computational algorithms including Genetic algorithm (GA), particle swarm optimization (PSO), Ant colony (AC) and Grey wolf optimization (GWO), and E) reinforcement learning techniques. This survey can be helpful for researchers to acquire deep understanding of feature selection techniques and choose a proper feature selection technique. Moreover, researcher can choose one of the A, B, C, D and E domains to become deep in this field for future study. A potential avenue for future research could involve exploring methods to reduce computational complexity while simultaneously maintaining performance efficiency. This would involve investigating ways to achieve a more efficient balance between computational resources and overall performance. For matrix-based techniques, the main limitation of these techniques lies in the need to tune the coefficients of the regularization terms, as this process can be challenging and time-consuming. For evolutionary computational techniques, getting stuck in local minimum and finding an appropriate objective function are two main limitations.
Subspace learning using structure learning and non-convex regularization: Hybrid technique with mushroom reproduction optimization in gene selection
2023, Computers in Biology and Medicine
Gene selection as a problem with high dimensions has drawn considerable attention in machine learning and computational biology over the past decade. In the field of gene selection in cancer datasets, different types of feature selection techniques in terms of strategy (filter, wrapper and embedded) and label information (supervised, unsupervised, and semi-supervised) have been developed. However, using hybrid feature selection can still improve the performance. In this paper, we propose a hybrid feature selection based on filter and wrapper strategies. In the filter-phase, we develop an unsupervised features selection based on non-convex regularized non-negative matrix factorization and structure learning, which we deem NCNMFSL. In the wrapper-phase, for the first time, mushroom reproduction optimization (MRO) is leveraged to obtain the most informative features subset. In this hybrid feature selection method, irrelevant features are filtered-out through NCNMFSL, and most discriminative features are selected by MRO. To show the effectiveness and proficiency of the proposed method, numerical experiments are conducted on Breast, Heart, Colon, Leukemia, Prostate, Tox-171 and GLI-85 benchmark datasets. SVM and decision tree classifiers are leveraged to analyze proposed technique and top accuracy are 0.97, 0.84, 0.98, 0.95, 0.98, 0.87 and 0.85 for Breast, Heart, Colon, Leukemia, Prostate, Tox-171 and GLI-85, respectively. The computational results show the effectiveness of the proposed method in comparison with state-of-art feature selection techniques.
An improved feature selection approach using global best guided Gaussian artificial bee colony for EMG classification
2023, Biomedical Signal Processing and Control
Electromyography (EMG) measures muscle relaxes or contractions during muscular activity through EMG signal. It plays a vital role in identifying muscle-related problems for clinical diagnosis. This paper presents an efficient EMG feature selection technique for classifying 17 different prosthetic hand movements recorded from 11 subjects. Two variants of the Artificial Bee Colony (ABC) algorithm, namely: i) Global Best Guided ABC (GbestABC) and ii) Gaussian ABC (GABC), are employed to propose an Improved Artificial Bee Colony (ABC) algorithm called Global Best Guided Gaussian ABC (GGABC) for solving global optimization problems. GbestABC performs better in the exploitation phase, whereas GABC performs better in the exploration phase in searching for the optimal solutions. The proposed GGABC takes advantage of GbestABC and GABC to counterbalance basic ABC's exploitation and exploration capability. Further, a binary version of GGABC known as binary GGABC (BGGABC) is developed to solve binary optimization problems and select optimal EMG signal classification features. Extensive experiments are carried out in three phases: i) GGABC for global optimization problems ii) BGGABC for EMG feature selection problems with other meta-heuristic-based competitors iii) BGGABC for EMG feature selection problems with well-known filter based techniques. K-nearest neighbor (KNN) classifier is used in experiments to validate and investigate the effectiveness of the proposed algorithm. Experimental result shows that the BGGABC-based EMG feature selection achieved 94.13% average classification accuracy and 97.06% best classification accuracy. Obtained results confirm that the proposed algorithm outperforms or is competitive with state-of-the-art algorithms in EMG feature selection and classification.
A correlation guided genetic algorithm and its application to feature selection
2022, Applied Soft Computing
Traditional feature selection methods based on genetic algorithms randomly evolve using unguided crossover operators and mutation operators. This leads to many inferior solutions being generated and verified using costly fitness functions. In this paper, we propose a new feature selection method based on a correlation-guided genetic algorithm. It first roughly checks the quality of the potential solutions to reduce the possibility of producing inferior solutions. Then more potentially superior solutions can be verified by the classifier to improve the efficiency of the evolutionary process. It is theoretically proven that the proposed method converges to the optimal solution with a very weak precondition. Numerical results on 4 artificial datasets and 6 real datasets show that compared with other existing methods, the proposed method is a competitive feature selection method with higher classification accuracy and a more efficient evolutionary process.

View all citing articles on Scopus

View full text

Computing, Artificial Intelligence and Information ManagementFramework for efficient feature selection in genetic algorithm based data mining

Abstract

Introduction

Section snippets

Feature selection

Genetic algorithm based data mining

Genetic algorithm with feature selection as a filter (GAFS)

Hausdorff distance

Problem domain

Conclusions

Adaptive and self-adaptive evolutionary computations

Using mutual information for selecting features in supervised neural net learning

IEEE Transactions on Neural Networks

Feature selection in mathematical programming

INFORMS Journal on Computing

The best two independent measurements are not the two best

IEEE Transactions on Systems, Man, and Cybernetics

General Topology

Inductive logic programming and knowledge discovery in databases

On the choice of variables in classification problems with dichotomous variables

Biometrika

A statistical perspective on knowledge discovery in databases

Computing, Artificial Intelligence and Information Management
Framework for efficient feature selection in genetic algorithm based data mining