MMDT: a multi-valued and multi-labeled decision tree classifier for data mining

https://doi.org/10.1016/j.eswa.2004.12.035Get rights and content

Abstract

We have proposed a decision tree classifier named MMC (multi-valued and multi-labeled classifier) before. MMC is known as its capability of classifying a large multi-valued and multi-labeled data. Aiming to improve the accuracy of MMC, this paper has developed another classifier named MMDT (multi-valued and multi-labeled decision tree). MMDT differs from MMC mainly in attribute selection. MMC attempts to split a node into child nodes whose records approach the same multiple labels. It basically measures the average similarity of labels of each child node to determine the goodness of each splitting attribute. MMDT, in contrast, uses another measuring strategy which considers not only the average similarity of labels of each child node but also the average appropriateness of labels of each child node. The new measuring strategy takes scoring approach to have a look-ahead measure of accuracy contribution of each attribute's splitting. The experimental results show that MMDT has improved the accuracy of MMC.

Introduction

The purpose of the decision tree classifier is to classify instances based on values of ordinary attributes and class label attribute. Traditionally, the data set is single-valued and single-labeled. In this data set, each record has many single-valued attributes and a given single-labeled attribute (i.e. class label attribute), and the class labels that can have two or more than two types are exclusive to each other or one another. Prior art decision tree classifiers, such as ID3 (Quinlan, 1979, Quinlan, 1986), Distance-based method (Mantaras, 1991), IC (Agrawal, Ghosh, Imielinski, Iyer, & Swami, 1992), C4.5 (Quinlan, 1993), Fuzzy ID3 (Umano et al., 1994), CART (Steinberg & Colla, 1995), SLIQ (Mehta, Agrawal, & Rissanen, 1996), SPRINT (Shafer, Agrawal, & Mehta, 1996), Rainforest (Gehrke, Ramakrishnan, & Ganti, 1998) and PUBLIC (Rastogi & Shim, 1998), all focus on this single-valued and single-labeled data set.

However, there is multi-valued and multi-labeled data in the real world as shown in Table 1. Multi-valued data means that a record can have multiple values for an ordinary attribute. Multi-labeled data means that a record can belong to multiple class labels, and the class labels are not exclusive to each other or one another. Readers might have difficulties to distinguish multi-labeled data from two-classed or multi-classed data mentioned in some related works. To clarify this confusion, we discuss the exclusiveness among classes, number of class and representation of the class label attribute in the related works as follows:

  • 1.

    Exclusiveness: Each data can only belong to a single class. Classes are exclusive to one another. ID3, Distance-based Method, IC, C4.5, Fuzzy ID3, CART, SLIQ, SPRINT, Rainforest and PUBLIC are such examples.

  • 2.

    Number of class: Data with classes classified into two types in the class label attribute is called two-classed data. ID3 and C4.5 are such examples. Data with classes classified into more than two types in the class label attribute is called multi-classed data. IC, CART and Fuzzy ID3 are such examples.

  • 3.

    Label representation: Data with a single value for the class label attribute is called single-labeled data. ID3, Distance-based Method, IC, C4.5, Fuzzy ID3, CART, SLIQ, SPRINT, Rainforest and PUBLIC are such examples.

According to the discussion above, a multi-valued and multi-labeled data as we defined here can be regarded as a non-exclusive, multi-classed and multi-labeled data.

In our previous work (Chen, Hsu, & Chou, 2003), we have explained why the traditional classifiers are not capable of handling this multi-valued and multi-labeled data. To solve this multi-valued and multi-labeled classification problem, we have designed a decision tree classifier named MMC (Chen et al., 2003) before. MMC differs from the traditional ones in some major functions including growing a decision tree, assigning labels to represent a leaf and making a prediction for a new data. In the process of growing a tree, MMC proposes a new measure named weighted similarity for selecting multi-valued attribute to partition a node into child nodes to approach perfect grouping. To assign labels, MMC picks the ones with numbers large enough to represent a leaf. To make a prediction for a new data, MMC traverses the tree as usual, and as the traversing reaches several leaf nodes for the record with multi-valued attribute, MMC would union all the labels of the leaf nodes as the prediction result. Experimental results show that MMC can get an average predicting accuracy of 62.56%.

Having a decision classifier developed for the multi-valued and multi-labeled data, this research steps further to improve the classifier's accuracy. Considering the following over-fitting problems (Han and Kamber, 2001, Russell and Norving, 1995) of MMC, improvement on its predicting accuracy seems possible. First, MMC neglects to avoid the situation when the data set is too small. Therefore, it may choose some attributes irrelevant to the class labels. Second, MMC appears to prefer the attribute which splits into child nodes with larger similarity among multiple labels. Therefore, MMC exists inductive bias (Gordon & Desjardins, 1995).

Trying to minimize the over-fitting problems above, this paper proposes solutions as: (1) Set a constraint of size for the data set in each node to avoid the data set being too small. (2) Consider not only the average similarity of labels of each child node but also the average appropriateness of labels of each child node to decrease the bias problem of MMC.

Based on the propositions above, we have designed a new decision tree classifier to improve the accuracy of MMC. The decision tree classifier, named MMDT (multi-valued and multi-labeled decision tree), can construct a multi-valued and multi-labeled decision tree as Fig. 1 shows.

The rest of the paper is organized as follows. In Section 2, the symbols will be introduced first. In Section 3, the tree construction and data prediction algorithms are described. In Section 4, the experiments are presented. And, finally, Section 5 makes summaries and conclusions.

Section snippets

Notation

The symbols for the multi-valued and multi-labeled classification problem are formally stated as follows:

  • (a)

    Given that D is a training set, |D| denotes the number of records.

  • (b)

    C denotes a set of class labels, C={Ci|Ci is a class label, i=1,…,k}. The number of class labels in C is known in advance. |C| denotes the number of class labels k.

  • (c)

    A denotes a set of attributes, A={Ai|Ai is any ordinary attribute of D, i=1,…,n}. |A| denotes the number of attributes n.

  • (d)

    ‘Multiple labels’ is represented as

The algorithms

The development of MMDT is based on MMC (Chen et al., 2003). For completeness, we briefly describe the related part of the MMC algorithm in Section 3.1 first. Then, we present the MMDT algorithm in Section 3.2. Finally, we describe how MMDT makes a prediction and how to evaluate the accuracy of the prediction in Section 3.3.

Experiments

Han and Kamber addressed that the classification and prediction methods could be compared and evaluated according to the six criteria (Han, 2000, Han and Kamber, 2001): predictive accuracy, speed, robustness, scalability, interpretability and goodness of rules. This paper focuses on predictive accuracy and goodness of rules. We describe the experimental design in Section 4.1 first, then present the experimental results in Section 4.2.

Summary and conclusion

This research has designed a decision tree classifier, MMDT, to improve the accuracy of MMC by minimizing the over-fitting problems. In MMDT, we set a constraint of size for the data set in each node to avoid the data set being too small; and we consider not only the average similarity of labels of each child node but also the average appropriateness of labels of each child node to decrease the bias problem. The experimental results show that MMDT has improved the accuracy of MMC.

The main works

References (21)

  • Y. Chen et al.

    Constructing a multi-valued and multi-labeled decision tree

    Expert Systems with Applications

    (2003)
  • J. Han et al.

    Generalization-based data mining in object-oriented databases using an object-cube model

    Data and Knowledge Engineering

    (1998)
  • Agrawal, R., Ghosh, S., Imielinski, T., Iyer, B., & Swami, A. (1992). An interval classifier for database mining...
  • R. Agrawal et al.

    Database mining: A performance perspective

    IEEE Transactions on Knowledge and Data Engineering

    (1993)
  • Gehrke, J., Ramakrishnan, R., & Ganti, V. (1998). Rainforest: A framework for fast decision tree construction of large...
  • D.F. Gordon et al.

    Evaluation and selection of biases in machine learning

    Machine Learning

    (1995)
  • Han, J. (2000). From data mining to web mining: An overview. Conference tutorial (in PowerPoint), 2000 international...
  • J. Han et al.

    Data mining: Concepts and techniques

    (2001)
  • R.L.D. Mantaras

    A distance-based attribute selection measure for decision tree induction

    Machine Learning

    (1991)
  • Mehta, M., Agrawal, R., & Rissanen, J. (1996). SLIQ: A fast scalable classifier for data mining. Proceedings of the...
There are more references available in the full text version of this article.

Cited by (34)

  • The complexity of comparing multiply-labelled trees by extending phylogenetic-tree metrics

    2019, Theoretical Computer Science
    Citation Excerpt :

    MUL-trees appear in biogeographical studies [14] where they are also known as area cladograms, in the study of the evolution of gene families [16] where multiple labels represent paralogous genes in the same genome, and also within approaches to construct phylogenetic networks [18]. MUL-trees and related structures also appear in areas such as data-mining [8] and string-matching [11]. Note that this problem has been proven to be strongly NP-complete, meaning that the hardness holds even if the size of the integers is polynomial in the input, by a reduction from 3-PARTITION [1].

  • An improved Id3 algorithm for medical data classification

    2018, Computers and Electrical Engineering
    Citation Excerpt :

    The strategy of Gini index chooses the attribute whose Gini Index is minimum after splitting. Besides, some researches [19,25] focus on multi-valued and multi-labeled data where the traditional decision tree algorithms have been proved to be not applicable. However, the issue of multi-value attribute is a special case of multi-variated attribute, since multiple values of a certain attribute in a sample can be seen as a special attribute value.

  • A framework to generate synthetic multi-label datasets

    2014, Electronic Notes in Theoretical Computer Science
  • Emotion classification for short texts: an improved multi-label method

    2023, Humanities and Social Sciences Communications
  • Fuzzy-Based Approach for Clustering Data with Multivalued Features

    2022, Wireless Communications and Mobile Computing
View all citing articles on Scopus
View full text