Computing, Artificial Intelligence and Information Management
Framework for efficient feature selection in genetic algorithm based data mining

https://doi.org/10.1016/j.ejor.2006.02.040Get rights and content

Abstract

We present the design of more effective and efficient genetic algorithm based data mining techniques that use the concepts of feature selection. Explicit feature selection is traditionally done as a wrapper approach where every candidate feature subset is evaluated by executing the data mining algorithm on that subset. In this article we present a GA for doing both the tasks of mining and feature selection simultaneously by evolving a binary code along side the chromosome structure used for evolving the rules. We then present a wrapper approach to feature selection based on Hausdorff distance measure. Results from applying the above techniques to a real world data mining problem show that combining both the feature selection methods provides the best performance in terms of prediction accuracy and computational efficiency.

Introduction

The ubiquity of databases in almost every area of human endeavor has resulted in the rapid increase in the amount of data collected and stored in industry, government, and scientific organizations. Data in these databases vary in format and content ranging from trillions of point-of-sale transactions and credit card purchases to pixel images of distant galaxies. It is not uncommon to come across databases that are measured in terabytes of data. For example, Wal-Mart, the chain of over 2000 retail stores, uploads over 20 million point-of-sale transactions to an AT&T massively parallel system with more than 1000 processors running a centralized database every day [4]. Wal-Mart operates a data warehouse with over 583 terabytes of sales and inventory data. Although Moore’s law does not cover the growth curve of databases, the amount of data stored by businesses nearly doubles every 12–18 months. These collections of massive amounts of data have created opportunities to monitor, analyze and predict processes of interest. In today’s fierce competitive business environment, firms need to rapidly turn these terabytes of data into significant insights into their customers, markets, products and processes to guide their marketing, investment, and management strategies to gain competitive advantage.

Although these data contain buried valuable information that can be beneficially utilized, raw data by itself does not provide much information. Raw data must first be processed to extract patterns or useful knowledge. To this end, the development of effective and efficient methods for deriving knowledge from these data is becoming increasingly important [16]. Recent efforts under the general rubric of data mining represent a strategic response to this growing need for analytic tools that deliver in situations where traditional analytical methods fail. This is especially evident when dealing with unstructured data that are noisy and incomplete, and where scalability of traditional methods is an issue. By discovering hidden, implicit, and previously unknown and potentially useful patterns and relationships in the data, data mining enables users to extract greater value from their data than simple query and analyses approaches.

Some of the commonly used data mining algorithms fall under the following categories: decision trees and rules [26] nonlinear regression and classification methods [14], example-based methods [18], probabilistic graphical dependency models [30], and relational learning models [12]. Over the years genetic algorithms have been successfully applied in learning tasks in different domains, like chemical process control [27], financial classification [28], manufacturing scheduling [22], among others.

In this paper we propose frameworks for data mining using genetic algorithms, implement these, and evaluate their performance using examples. We chose genetic algorithm due to its simplicity and its capability as a powerful search mechanism. We present the design of more effective and efficient genetic algorithm based data mining techniques that use the concepts of self-adaptive feature selection together with a wrapper feature selection method based on Hausdorff distance measure. A genetic algorithm [17] uses a population of individual solution structures called chromosomes. The fitness of an individual solution is its performance measure. This measure is used to favor selection of successful parents for new offspring, such that the whole population of solutions incrementally evolves towards greater fitness. Offspring solutions are produced from parent solutions by the application of crossover and mutation operators. Theory shows that the knowledge about desirable solutions is advantageously stored in the population itself, implicitly contained in the surviving chromosomes. We take advantage of this principle in developing modified frameworks essentially using the genetic algorithm at its core.

The rest of this paper is organized as follows. We present a brief discussion on feature selection in the next section. In Section 3, we discuss genetic algorithm and its basic dynamics. We also include the first framework for the basic genetic algorithm case and show how we implement and evaluate this methodology. This is followed by the presentation of a modified filter-based genetic algorithm with embedded feature selection in Section 4. In Section 5, we provide a brief discussion of the Hausdorff distance measure, and then present the proposed framework of genetic algorithm with Hausdorff distance as wrapper-based feature selection. We discuss the effectiveness of feature selection in Section 6, including experimental results using a real world data set. Section 7 concludes this paper.

Section snippets

Feature selection

A typical real world data set consists of as many features that are deemed necessary. This is constrained by (1) knowledge of the domain of interest, and in turn knowledge of essential features that capture knowledge in this domain, (2) availability of these essential features, (3) resources available to gather these available essential features, (4) resources available to store, maintain, and retrieve these collected features. Given these constraints, it is clear that not all features that end

Genetic algorithm based data mining

In this section we present the design of a genetic algorithm for rule learning in a data mining application. Assume that the data mining problem has k attributes and we have a set of training examplesψ={(Et,c)t=1,,T},where T is the total number of examples, each example Et is a vector of k attribute valuesEt=[et1,et2,,etk]and c is its classification value (usually a binary value indicating whether the example is a positive or a negative). The goal of data mining is to learn concepts that can

Genetic algorithm with feature selection as a filter (GAFS)

The GA presented above is modified as follows to incorporate the self-adaptive feature selection method. Each population member contains, in addition to the disjunct, a binary vector for feature selection that also evolves alongside the disjunct. A feature is selected if the corresponding bit in the selection vector is 1. For example, if we have five attributes in the original feature set a typical rule (disjunct) represented in the new approach would look like the following:X1X2X3X4X5[00.1][

Hausdorff distance

Although Hausdorff distance measure is widely used in computer vision and graphics applications due to its excellent discriminant properties, it has not received much attention in the feature selection literature. The Hausdorff distance [24] is a measure of similarity, with respect to their position in metric space, of two non-empty compact sets A and B. It measures the extent to which each point in a set is located relative to those in another set. Let X1 = {x11, x12,  , x1m} and X2 = {x21, x22,  , x2n}

Problem domain

The problem is related to the control of a chemical process at an Eastman Kodak facility in Longview, TX, producing a certain chemical product, with about 30 process variables. In the process of producing the product, an undesirable byproduct is produced which is not measured directly. To remove this byproduct, an expensive chemical is added in just sufficient quantities. The problem is to change the controllable process variables (9 out of 30 variables) so that the usage of the expensive

Conclusions

Data mining algorithms perform well when concepts to be learned are mapped to space that is not complex. However, most real world data mining applications do not fall into this category, and thus must be able to perform well even in the presence of noise, irrelevant features, several local optima, among others. Evolutionary algorithms are excellent for performing global search, especially when the search landscape is complex with multiple global as well local optima. They have been shown to be

References (31)

  • A. Ahmed, Mohamed Deriche, An optimal feature selection technique using the concept of mutual information, in:...
  • H. Alt, B. Behrends, J. Bloemer, Approximate matching of polygonal shapes (extended abstract), in: Proceedings of the...
  • P.J. Angeline

    Adaptive and self-adaptive evolutionary computations

  • C. Babcock, Parallel Processing Mines Retail Data, Computer World, 6 September...
  • R. Battiti

    Using mutual information for selecting features in supervised neural net learning

    IEEE Transactions on Neural Networks

    (1994)
  • J.M. Benitez, J.L. Castro, C.J. Mantas, F. Rojas, A neuro-fuzzy approach for feature selection, in: Proceedings of IFSA...
  • P.S. Bradley et al.

    Feature selection in mathematical programming

    INFORMS Journal on Computing

    (1998)
  • Bjorn Chambless, David Scarborough, Information theoretic feature selection for a neural behavioral model, in:...
  • Frans M. Coetzee, Eric Glover, Steve Lawrence, C. Lee Giles, Feature selection in web applications by ROC inflections...
  • T.M. Cover

    The best two independent measurements are not the two best

    IEEE Transactions on Systems, Man, and Cybernetics

    (1974)
  • A. Csaszar

    General Topology

    (1978)
  • S. Dzeroski

    Inductive logic programming and knowledge discovery in databases

  • J.D. Elashoff et al.

    On the choice of variables in classification problems with dichotomous variables

    Biometrika

    (1967)
  • J. Elder et al.

    A statistical perspective on knowledge discovery in databases

  • T. Elomaa, E. Ukkonen, A geometric approach to feature selection, in: Proceedings of the European Conference on Machine...
  • Cited by (95)

    View all citing articles on Scopus
    View full text