Elsevier

Pattern Recognition

Volume 44, Issue 2, February 2011, Pages 330-349
Pattern Recognition

Mining data with random forests: A survey and results of new tests

https://doi.org/10.1016/j.patcog.2010.08.011Get rights and content

Abstract

Random forests (RF) has become a popular technique for classification, prediction, studying variable importance, variable selection, and outlier detection. There are numerous application examples of RF in a variety of fields. Several large scale comparisons including RF have been performed. There are numerous articles, where variable importance evaluations based on the variable importance measures available from RF are used for data exploration and understanding. Apart from the literature survey in RF area, this paper also presents results of new tests regarding variable rankings based on RF variable importance measures. We studied experimentally the consistency and generality of such rankings. Results of the studies indicate that there is no evidence supporting the belief in generality of such rankings. A high variance of variable importance evaluations was observed in the case of small number of trees and small data sets.

Introduction

Growing size of data sets increases the variety of problems characterized by a large number of variables. Nowadays, it is not uncommon that the number of variables N is larger than the number of observations M. Microarray gene expression data is a characteristic example, where NM most often. Traditional statistical techniques experience problems, when N>M. Therefore, machine learning-based techniques are usually applied in such cases.

Support vector machine (SVM) [1], [2], multilayer perceptron (MLP) [3], relevance vector machine (RVM) [4], and various ensembling approaches [5] are probably the most popular machine learning techniques applied to create predictors. SVM and RVM make no assumptions about the data, are able to find the global minimum of the objective function, and can provide near optimal performance. Moreover, the complexity of these techniques depends on the number of support (relevance) vectors, but not on the dimensionality of the input space. However, predictors based on these techniques provide too little insight as to the importance of variables to the predictor derived. The transparency is very important in some application areas, such as medical decision support or quality control, for example.

By contrast, classification and regression trees [6], [7] are known for their transparency. However, decision trees are rather sensitive to small perturbations in the learning set. It has been demonstrated that this problem can be mitigated by applying bagging [8], [9]. Random forests proposed by Breiman [10] and studied by Biau et al. [11] is a combination of the random subspace method proposed by Ho [12] and bagging. RF have been used for large variety of tasks, including: identification of DNA-binding proteins [13], segmentation of video objects [14], classification of hyper-spectral data [15], [16], prediction of the vegetation type occurrence in Belgian lowland valley based on spatially distributed measurements of environmental conditions [17], [18], to predict distributions of bird and mammal species characteristic to the eastern slopes of the central Andes [19], Czech language modeling in the lecture recognition task [20], diagnosing Alzheimer's disease based on single photon emission computed tomography (SPECT) data [21], genetic polymorphisms identification [22], prediction of long disordered regions in protein sequences [23], classification of agricultural practices based on Landsat satellite imagery [24], classification of aerial images [25], analysis of phenolic antioxidants in chemistry [26], recognition of handwritten digits [27], categorizing time-depth profiles of diving vertebrates [28], and many others.

Section snippets

Weak learners and random forests

Let us assume that given is a set of training data Xt={(xm,ym),m=1,,M}, where xm is an input observation and ym is a predictor output. A weak learner can be created using the training set Xt. A weak learner is a predictor f(x,Xt) having a low bias and a high variance [30]. By randomly sampling from the set Xt, a collection of weak learners f(x,Xt,θk) can be created, with f(x,Xt,θk) being the kth weak learner and θk is the random vector selecting data points for the kth weak learner. By

Objectives of the study

Several researchers have found that RF error rates compare favorably to other predictors, including logistic regression (LR), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), k-NN, MLP, SVM, classification and regression trees (CART), naive Bayes [10], [31], [32], [33], and other techniques. However, contra-examples can also be easily found [34], see Table 2. This is expected, since according to the No Free Lunch theorem [35], there is no single classifier model, which

Large scale studies including random forests

Meyer et al. investigated SVM performance and fulfilled a large scale comparison of 16 classification and nine regression techniques [38]. RF, SVM, MLP, and bagged ensembles of trees were among the techniques compared. Hyper-parameters of RF, SVM, and MLP were carefully selected. The classification benchmark was done on 21 data sets, while 12 data sets were used for the regression tests. Most of the data sets come from the UCI machine learning database. The performance was assessed using 10

Accuracy related studies

There are a large number of small scale studies aiming to compare the performance of several techniques on one or very few data sets. A large variety of application areas are explored. Rigor of the comparisons varies greatly in different studies. We summarize results of these studies in two tables. Table 1 presents a survey of studies where random forests outperformed or performed on the same level as other techniques. Table 2 summarize studies where random forests were outperformed by other

New tests regarding variable importance measures

We used four public databases for these tests: Waveform, Satimage, Thyroid, and Wisconsin Diagnostic Breast Cancer—WDBC (http://archive.ics.uci.edu/ml/).

There are three classes of waves in the Waveform database [6] with equal number of instances in each class, where each class is generated from a combination of 2 of 3 “base” waves. All of 40 variables used include noise with mean 0 and variance 1. The later 19 variables are all noise variables with mean 0 and variance 1. The Satimage data set

Tests concerning problem complexity

The aim of these studies is to get some insights into “suitability” of a problem at hand for RF-based classification. To assess the problem complexity, several measures studied in [36], [37] are used in this work. The measures are listed in Table 4.

The measures F1, F2, F3, and F4 reflect the degree of overlap of individual feature values, while the measures L1 and L2 assess linear separability of classes. To compute values of the measures L1, L2, and L3, a linear classifier is build. An SVM

Discussion

Fast training, the possibility of obtaining the generalization error estimate without splitting the data set into learning and validation subsets, variable importance evaluations available as a byproduct of training, only one parameter to be tuned experimentally, make RF a very popular data mining technique.

Acknowledgements

Useful suggestions from the referees are gratefully acknowledged. We acknowledge the support from the agency for International Science and Technology Development Programmes in Lithuania (COST Actions IC0602 and IC0806). The infrastructure for parallel and distributed computing, and e-services (LitGrid) was used in the studies.

Antanas Verikas is currently holding a Professor position at both Halmstad University Sweden and Kaunas University of Technology, Lithuania. His research interests include image processing, pattern recognition, artificial neural networks, fuzzy logic, and visual media technology. He is a member of the International Pattern Recognition Society, European Neural Network Society, International Association of Science and Technology for Development, Swedish Society of Learning Systems, and a member

References (112)

  • K. Coussement et al.

    Improving customer attrition prediction by integrating emotions from client/company interaction emails and evaluating multiple classifiers

    Expert Systems with Applications

    (2009)
  • D. Perdiguero-Alonso et al.

    Random forests, a novel approach for discrimination of fish populations using parasites as biological tags

    International Journal for Parasitology

    (2008)
  • I. Koprinska et al.

    Learning to classify e-mail

    Information Sciences

    (2007)
  • B. Slabbinck et al.

    Towards large-scale FAME-based bacterial species identification using machine learning techniques

    Systematic and Applied Microbiology

    (2009)
  • G. Zhang et al.

    Discriminating acidic and alkaline enzymes using a random forest model with secondary structure amino acid composition

    Process Biochemistry

    (2009)
  • M.B. Garzon et al.

    Predicting habitat suitability with machine learning models: the potential area of Pinus sylvestris L. in the Iberian peninsula

    Ecological Modelling

    (2006)
  • S. Tognazzo et al.

    Probabilistic classifiers and automated cancer registration: an exploratory application

    Journal of Biomedical Informatics

    (2009)
  • D. Kocev et al.

    Using single- and multi-target regression trees and ensembles to model a compound index of vegetation condition

    Ecological Modelling

    (2009)
  • W. Buckinx et al.

    Customer base analysis: partial defection of behaviourally loyal clients in a non-contractual FMCG retail setting

    European Journal of Operational Research

    (2005)
  • P.M. Granitto et al.

    Rapid and non-destructive identification of strawberry cultivars by direct PTR-MS headspace analysis and data mining techniques

    Sensors and Actuators B

    (2007)
  • D. Donald et al.

    Adaptive wavelet modelling of a nested 3 factor experimental design in NIR chemometrics

    Chemometrics and Intelligent Laboratory Systems

    (2006)
  • L.I. Kuncheva et al.

    Diagnosing scrapie in sheep: a classification experiment

    Computers in Biology and Medicine

    (2007)
  • A. Brenning

    Benchmarking classifiers to optimally integrate terrain analysis and multispectral remote sensing in automatic rock glacier detection

    Remote Sensing of Environment

    (2009)
  • W. Buckinx et al.

    Predicting customer loyalty using the internal transactional database

    Expert Systems with Applications

    (2007)
  • T. Hancock et al.

    A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies

    Chemometrics and Intelligent Laboratory Systems

    (2005)
  • K.J. Archer et al.

    Empirical characterization of random forest variable importance measures

    Computational Statistics & Data Analysis

    (2008)
  • R. Harb et al.

    Exploring precrash maneuvers using classification trees and random forests

    Accident Analysis and Prevention

    (2009)
  • J. Fan et al.

    Multivariate exponential survival trees and their application to tooth prognosis

    Computational Statistics and Data Analysis

    (2009)
  • J. Zhang et al.

    Random-forests-based network intrusion detection systems

    IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews

    (2008)
  • D. Donald et al.

    Bagged super wavelets reduction for boosted prostate cancer classification of seldi-tof mass spectral serum profiles

    Chemometrics and Intelligent Laboratory Systems

    (2006)
  • B. Lariviere et al.

    Predicting customer retention and profitability by using random forests and regression forests techniques

    Expert Systems with Applications

    (2005)
  • L.C. Keely et al.

    Understanding preferences for income redistribution

    Journal of Public Economics

    (2008)
  • H. Wang et al.

    Hedged predictions for traditional Chinese chronic gastritis diagnosis with confidence machine

    Computers in Biology and Medicine

    (2009)
  • P.O. Gislason et al.

    Random forests for land cover classification

    Pattern Recognition Letters

    (2006)
  • V.N. Vapnik

    Statistical Learning Theory

    (1998)
  • J. Shawe-Taylor et al.

    Kernel Methods for Pattern Analysis

    (2004)
  • C.M. Bishop

    Pattern Recognition and Machine Learning

    (2006)
  • M.E. Tipping

    Sparse bayesian learning and the relevance vector machine

    Journal of Machine Learning Research

    (2001)
  • L. Breiman et al.

    Classification and Regression Trees

    (1993)
  • I.H. Witten et al.

    Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations

    (1999)
  • L. Breiman

    Bagging predictors

    Machine Learning

    (1996)
  • T.G. Dietterich

    An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization

    Machine Learning

    (2000)
  • L. Breiman

    Random forests

    Machine Learning

    (2001)
  • G. Biau et al.

    Consistency of random forests and other averaging classifiers

    Journal of Machine Learning Research

    (2008)
  • T.K. Ho

    The random subspace method for constructing decision forests

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1998)
  • H.T. Chen et al.

    Segmenting highly articulated video objects with weak-prior random forests

  • J. Ham et al.

    Investigation of the random forest framework for classification of hyperspectral data

    IEEE Transactions on Geoscience and Remote Sensing

    (2005)
  • M.M. Crawford et al.

    Random forests of binary hierarchical classifiers for analysis of hyperspectral data

  • P.A. Hernandez et al.

    Predicting species distributions in poorly-studied landscapes

    Biodiversity and Conservation

    (2008)
  • I. Oparin et al.

    Morphological random forests for language modeling of inflectional languages

  • Cited by (0)

    Antanas Verikas is currently holding a Professor position at both Halmstad University Sweden and Kaunas University of Technology, Lithuania. His research interests include image processing, pattern recognition, artificial neural networks, fuzzy logic, and visual media technology. He is a member of the International Pattern Recognition Society, European Neural Network Society, International Association of Science and Technology for Development, Swedish Society of Learning Systems, and a member of the IEEE.

    Adas Gelzinis received the M.S. degree in Electrical Engineering from Kaunas University of Technology, Lithuania, in 1995. He received the Ph.D. degree in Computer Science from the same university, in 2000. He is a Senior Researcher in the Department of Electrical and Control Equipment at Kaunas University of Technology. His research interests include artificial neural networks, kernel methods, pattern recognition, signal and image processing, texture classification.

    Marija Bacauskiene is a Senior Researcher in the Department of Electrical and Control Equipment at Kaunas University of Technology, Lithuania. Her research interests include artificial neural networks, image processing, pattern recognition, and fuzzy logic. She participated in various research projects and published numerous papers in these areas.

    View full text