Mining data with random forests: A survey and results of new tests
Introduction
Growing size of data sets increases the variety of problems characterized by a large number of variables. Nowadays, it is not uncommon that the number of variables N is larger than the number of observations M. Microarray gene expression data is a characteristic example, where most often. Traditional statistical techniques experience problems, when . Therefore, machine learning-based techniques are usually applied in such cases.
Support vector machine (SVM) [1], [2], multilayer perceptron (MLP) [3], relevance vector machine (RVM) [4], and various ensembling approaches [5] are probably the most popular machine learning techniques applied to create predictors. SVM and RVM make no assumptions about the data, are able to find the global minimum of the objective function, and can provide near optimal performance. Moreover, the complexity of these techniques depends on the number of support (relevance) vectors, but not on the dimensionality of the input space. However, predictors based on these techniques provide too little insight as to the importance of variables to the predictor derived. The transparency is very important in some application areas, such as medical decision support or quality control, for example.
By contrast, classification and regression trees [6], [7] are known for their transparency. However, decision trees are rather sensitive to small perturbations in the learning set. It has been demonstrated that this problem can be mitigated by applying bagging [8], [9]. Random forests proposed by Breiman [10] and studied by Biau et al. [11] is a combination of the random subspace method proposed by Ho [12] and bagging. RF have been used for large variety of tasks, including: identification of DNA-binding proteins [13], segmentation of video objects [14], classification of hyper-spectral data [15], [16], prediction of the vegetation type occurrence in Belgian lowland valley based on spatially distributed measurements of environmental conditions [17], [18], to predict distributions of bird and mammal species characteristic to the eastern slopes of the central Andes [19], Czech language modeling in the lecture recognition task [20], diagnosing Alzheimer's disease based on single photon emission computed tomography (SPECT) data [21], genetic polymorphisms identification [22], prediction of long disordered regions in protein sequences [23], classification of agricultural practices based on Landsat satellite imagery [24], classification of aerial images [25], analysis of phenolic antioxidants in chemistry [26], recognition of handwritten digits [27], categorizing time-depth profiles of diving vertebrates [28], and many others.
Section snippets
Weak learners and random forests
Let us assume that given is a set of training data , where xm is an input observation and ym is a predictor output. A weak learner can be created using the training set . A weak learner is a predictor having a low bias and a high variance [30]. By randomly sampling from the set , a collection of weak learners can be created, with being the kth weak learner and is the random vector selecting data points for the kth weak learner. By
Objectives of the study
Several researchers have found that RF error rates compare favorably to other predictors, including logistic regression (LR), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), k-NN, MLP, SVM, classification and regression trees (CART), naive Bayes [10], [31], [32], [33], and other techniques. However, contra-examples can also be easily found [34], see Table 2. This is expected, since according to the No Free Lunch theorem [35], there is no single classifier model, which
Large scale studies including random forests
Meyer et al. investigated SVM performance and fulfilled a large scale comparison of 16 classification and nine regression techniques [38]. RF, SVM, MLP, and bagged ensembles of trees were among the techniques compared. Hyper-parameters of RF, SVM, and MLP were carefully selected. The classification benchmark was done on 21 data sets, while 12 data sets were used for the regression tests. Most of the data sets come from the UCI machine learning database. The performance was assessed using 10
Accuracy related studies
There are a large number of small scale studies aiming to compare the performance of several techniques on one or very few data sets. A large variety of application areas are explored. Rigor of the comparisons varies greatly in different studies. We summarize results of these studies in two tables. Table 1 presents a survey of studies where random forests outperformed or performed on the same level as other techniques. Table 2 summarize studies where random forests were outperformed by other
New tests regarding variable importance measures
We used four public databases for these tests: Waveform, Satimage, Thyroid, and Wisconsin Diagnostic Breast Cancer—WDBC (http://archive.ics.uci.edu/ml/).
There are three classes of waves in the Waveform database [6] with equal number of instances in each class, where each class is generated from a combination of 2 of 3 “base” waves. All of 40 variables used include noise with mean 0 and variance 1. The later 19 variables are all noise variables with mean 0 and variance 1. The Satimage data set
Tests concerning problem complexity
The aim of these studies is to get some insights into “suitability” of a problem at hand for RF-based classification. To assess the problem complexity, several measures studied in [36], [37] are used in this work. The measures are listed in Table 4.
The measures F1, F2, F3, and F4 reflect the degree of overlap of individual feature values, while the measures L1 and L2 assess linear separability of classes. To compute values of the measures L1, L2, and L3, a linear classifier is build. An SVM
Discussion
Fast training, the possibility of obtaining the generalization error estimate without splitting the data set into learning and validation subsets, variable importance evaluations available as a byproduct of training, only one parameter to be tuned experimentally, make RF a very popular data mining technique.
Acknowledgements
Useful suggestions from the referees are gratefully acknowledged. We acknowledge the support from the agency for International Science and Technology Development Programmes in Lithuania (COST Actions IC0602 and IC0806). The infrastructure for parallel and distributed computing, and e-services (LitGrid) was used in the studies.
Antanas Verikas is currently holding a Professor position at both Halmstad University Sweden and Kaunas University of Technology, Lithuania. His research interests include image processing, pattern recognition, artificial neural networks, fuzzy logic, and visual media technology. He is a member of the International Pattern Recognition Society, European Neural Network Society, International Association of Science and Technology for Development, Swedish Society of Learning Systems, and a member
References (112)
- et al.
Soft combination of neural classifiers: a comparative study
Pattern Recognition Letters
(1999) - et al.
Identification of DNA-binding proteins using structural, electrostatic and evolutionary features
Journal of Molecular Biology
(2009) - et al.
Random forests as a tool for ecohydrological distribution modelling
Ecological Modelling
(2007) - et al.
Uncertainty propagation in vegetation distribution models based on ensemble classifiers
Ecological Modelling
(2009) - et al.
Monitoring of cropland practices for carbon sequestration purposes in north central montana by landsat remote sensing
Remote Sensing of Environment
(2009) - et al.
QSAR analysis of phenolic antioxidants using MOLMAP descriptors of local properties
Bioorganic & Medicinal Chemistry
(2006) - et al.
A validated approach for supervised dive classification in diving vertebrates
Journal of Experimental Marine Biology and Ecology
(2008) - et al.
The support vector machine under test
Neurocomputing
(2003) - et al.
Customer churn prediction using improved balanced random forests
Expert Systems with Applications
(2009) - et al.
Churn prediction in subscription services: an application of support vector machines while comparing two parameter-selection techniques
Expert Systems with Applications
(2008)
Improving customer attrition prediction by integrating emotions from client/company interaction emails and evaluating multiple classifiers
Expert Systems with Applications
Random forests, a novel approach for discrimination of fish populations using parasites as biological tags
International Journal for Parasitology
Learning to classify e-mail
Information Sciences
Towards large-scale FAME-based bacterial species identification using machine learning techniques
Systematic and Applied Microbiology
Discriminating acidic and alkaline enzymes using a random forest model with secondary structure amino acid composition
Process Biochemistry
Predicting habitat suitability with machine learning models: the potential area of Pinus sylvestris L. in the Iberian peninsula
Ecological Modelling
Probabilistic classifiers and automated cancer registration: an exploratory application
Journal of Biomedical Informatics
Using single- and multi-target regression trees and ensembles to model a compound index of vegetation condition
Ecological Modelling
Customer base analysis: partial defection of behaviourally loyal clients in a non-contractual FMCG retail setting
European Journal of Operational Research
Rapid and non-destructive identification of strawberry cultivars by direct PTR-MS headspace analysis and data mining techniques
Sensors and Actuators B
Adaptive wavelet modelling of a nested 3 factor experimental design in NIR chemometrics
Chemometrics and Intelligent Laboratory Systems
Diagnosing scrapie in sheep: a classification experiment
Computers in Biology and Medicine
Benchmarking classifiers to optimally integrate terrain analysis and multispectral remote sensing in automatic rock glacier detection
Remote Sensing of Environment
Predicting customer loyalty using the internal transactional database
Expert Systems with Applications
A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies
Chemometrics and Intelligent Laboratory Systems
Empirical characterization of random forest variable importance measures
Computational Statistics & Data Analysis
Exploring precrash maneuvers using classification trees and random forests
Accident Analysis and Prevention
Multivariate exponential survival trees and their application to tooth prognosis
Computational Statistics and Data Analysis
Random-forests-based network intrusion detection systems
IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews
Bagged super wavelets reduction for boosted prostate cancer classification of seldi-tof mass spectral serum profiles
Chemometrics and Intelligent Laboratory Systems
Predicting customer retention and profitability by using random forests and regression forests techniques
Expert Systems with Applications
Understanding preferences for income redistribution
Journal of Public Economics
Hedged predictions for traditional Chinese chronic gastritis diagnosis with confidence machine
Computers in Biology and Medicine
Random forests for land cover classification
Pattern Recognition Letters
Statistical Learning Theory
Kernel Methods for Pattern Analysis
Pattern Recognition and Machine Learning
Sparse bayesian learning and the relevance vector machine
Journal of Machine Learning Research
Classification and Regression Trees
Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
Bagging predictors
Machine Learning
An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization
Machine Learning
Random forests
Machine Learning
Consistency of random forests and other averaging classifiers
Journal of Machine Learning Research
The random subspace method for constructing decision forests
IEEE Transactions on Pattern Analysis and Machine Intelligence
Segmenting highly articulated video objects with weak-prior random forests
Investigation of the random forest framework for classification of hyperspectral data
IEEE Transactions on Geoscience and Remote Sensing
Random forests of binary hierarchical classifiers for analysis of hyperspectral data
Predicting species distributions in poorly-studied landscapes
Biodiversity and Conservation
Morphological random forests for language modeling of inflectional languages
Cited by (0)
Antanas Verikas is currently holding a Professor position at both Halmstad University Sweden and Kaunas University of Technology, Lithuania. His research interests include image processing, pattern recognition, artificial neural networks, fuzzy logic, and visual media technology. He is a member of the International Pattern Recognition Society, European Neural Network Society, International Association of Science and Technology for Development, Swedish Society of Learning Systems, and a member of the IEEE.
Adas Gelzinis received the M.S. degree in Electrical Engineering from Kaunas University of Technology, Lithuania, in 1995. He received the Ph.D. degree in Computer Science from the same university, in 2000. He is a Senior Researcher in the Department of Electrical and Control Equipment at Kaunas University of Technology. His research interests include artificial neural networks, kernel methods, pattern recognition, signal and image processing, texture classification.
Marija Bacauskiene is a Senior Researcher in the Department of Electrical and Control Equipment at Kaunas University of Technology, Lithuania. Her research interests include artificial neural networks, image processing, pattern recognition, and fuzzy logic. She participated in various research projects and published numerous papers in these areas.