Random Forests for Big Data
Introduction
Big Data is one of the major challenges of statistical science and a lot of recent references start to think about the numerous consequences of this new context from the algorithmic viewpoint and for the theoretical implications of this new framework [1], [2], [3]. Big Data always involve massive data: for instance, Thusoo et al. [4] indicate that Facebook© had more than 21 PB of data in 2010. They also often include data streams and data heterogeneity [5]. On a practical point of view, they are characterized by the fact that data are frequently not structured data, properly indexed in a database. Thus, simple queries cannot be easily performed on such data. These features lead to the famous three Vs (Volume, Velocity and Variety) highlighted by the Gartner, Inc., the advisory company about information technology research,1 now often augmented with other Vs [6]. In the most extreme situations, data can even have a size too large to fit in a single computer memory. Then data are distributed among several computers. In this case, the distribution of the data is managed using specific frameworks dedicated to shared storage computing environments, such as Hadoop.2
For statistical science, the problem posed by this large amount of data is twofold: first, as many statistical procedures have devoted few attention to computational runtimes, they can take too long to provide results in an acceptable time. When dealing with complex tasks, such as learning a prediction model or performing a complex exploratory analysis, this issue can occur even if the dataset would be considered of a moderate size for other simpler tasks. Also, as pointed out in [7], the notion of Big Data depends itself on the available computing resources. This is especially true when relying on the free statistical software R [8], massively used in the statistical community, which capabilities are strictly limited by RAM. In this case, data can be considered as “large” if their size exceeds 20% of RAM and as “massive” if it exceeds 50% of RAM, because this amount of data strongly limits the available memory for learning the statistical model itself. For memory demanding statistical methods and implementations, the RAM can even be overloaded with datasets occupying a very moderate amount of the RAM. As pointed out in [3], in the near future, statistics will have to deal with problems of scale and computational complexity to remain relevant. In particular, the collaboration between statisticians and computer scientists is needed to control runtimes that will maintain the statistical procedures usable on large-scale data while ensuring good statistical properties.
Recently, some statistical methods have been adapted to process Big Data, including linear regression models, clustering methods and bootstrapping schemes [9], [10]. The main proposed strategies are based on i) subsampling, ii) divide and conquer approach, iii) algorithm weakening and iv) online processing.
Subsampling is probably the simplest way to handle large datasets. It is proved efficient to approximate spectral analysis of large matrices using an approximate decomposition, such as the Nyström algorithm [11]. It is also a valuable strategy to produce an approximate bootstrap scheme [12]. Simple random sampling often produces a representative enough subsample but can be hard to obtain if data are distributed over different computers and the subsample itself has to be built in parallel: online subsampling strategies allowing stratified sampling are presented in [13] and can overcome this problem. Improved subsampling strategies can also be designed, like the core-set strategy used for clustering problems in [14], that extracts a relevant small set of points to perform approximate clustering efficiently. Finally, an alternative to alleviate the impact of the subsampling without the need to use sophisticated subsampling schemes is to perform several subsamplings and to combine the different results [15].
Divide and conquer approach consists in splitting the problem into several smaller problems and in gathering the different results in a final step. This approach is the one followed in the popular MapReduce programming paradigm [16]. Most of the time, the combination is based on a simple aggregation or averaging of the different results but this simple method might lead to biased estimations in some statistical models, as simple as a linear model. Solutions include re-weighting the different results [17].
Algorithm weakening is a very different approach, designed for methods based on convex optimization problems [18]. This method explicitly treats the trade-off between computational time and statistical accuracy using a hierarchy of relaxed optimization problems with increasing complexity.
Finally, online approaches update the results with sequential steps, each having a low computational cost. It very often requires a specific rewriting of the method to single out the specific contribution of a given observation to the method. In this case, the online update is strictly equivalent to the processing of the whole dataset but with a reduced computational time [19]. However, in most cases, such an equivalence can not be obtained and a modification of the original method is needed to allow online updates [20].
It has to be noted that only a few papers really address the question of the difference between the “small data” standard framework compared to the Big Data in terms of statistical accuracy when approximate versions of the original approach are used to deal with the large sample size. Noticeable exceptions are the article of Kleiner et al. [12] who prove that their “Bag of Little Bootstraps” method is statistically equivalent to the standard bootstrap, the article of Chen and Xie [17] who demonstrate asymptotic equivalence of their “divide-and-conquer” based estimator with the estimator based on all data in the setting of linear regression and the article of Yan et al. [11] who show that the mis-clustering rate of their subsampling approach, compared to what would have been obtained with a direct approach on the whole dataset, converges to zero when the subsample size grows (in an unsupervised setting).
Based on decision trees and combined with aggregation and bootstrap ideas, random forests (abbreviated RF in the sequel), were introduced by Breiman [21]. They are a powerful nonparametric statistical method allowing to consider regression problems as well as two-class and multi-class classification problems, in a single and versatile framework. The consistency of RF has recently been proved by Scornet et al. [22], to cite the most recent result. On a practical point of view, RF are widely used [23], [24] and exhibit extremely high performance with only a few parameters to tune. Since RF are based on the definition of several independent trees, it is thus straightforward to obtain a parallel and faster implementation of the RF method, in which many trees are built in parallel on different cores. However, direct parallel training of the trees might be intractable in practice, due to the large size of the bootstrap samples. As RF also include intensive resampling, it is natural to consider adapted bootstrapping schemes for the massive online context, in addition to parallel processing.
Even if the method has already been adapted and implemented to handle Big Data in various distributed environments (see, for instance, the libraries Mahout3 or MLib, the latter for the distributed framework Spark,4 among others), a lot of questions remain open. In this paper, we do not seek to make an exhaustive description of the various implementations of RF in scalable environments but we will highlight some problems posed to RF by the Big Data framework, describe several standard strategies that can be used and discuss their main features, drawbacks and differences with the original approach. We finally experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or “divide-and-conquer” approaches. The fifth variant relates to online learning of RF. To the best of our knowledge, no weakening strategy has been developed for RF.
Since the free statistical software R [8] is de facto the esperanto in the statistical community, and since the most flexible and widely used programs for designing random forests are also available in R, we have adopted it for numerical experiments as much as possible. More precisely, the R package randomForest, implementing the original RF algorithm using Breiman and Cutler's Fortran code, contains many options together with a detailed documentation. It has then been used in almost all experiments. The only exception is for online RF for which no implementation in R is available. A python library was used, as an alternative tool in order to provide a comparison of online learning with the alternative Big Data variants.
The paper is organized as follows. After this introduction, we briefly recall some basic facts about RF in Section 2. Then, Section 3 is focused on strategies for scaling random forests to Big Data: some proposals about RF in parallel environments are reviewed, as well as a description of online strategies. The section includes a comparison of the features of every method and a discussion about the estimation of the out-of-bag error. Section 4 is devoted to numerical experiments on two massive datasets, an extensive study on a simulated one and an application to real world data. Finally, Section 5 collects some conclusions and discusses two open perspectives.
Section snippets
Random forests
Denoting by a learning set of independent observations of the random vector , we distinguish where is the vector of the predictors (or explanatory variables) from the explained variable, where Y is either a class label for classification problems or a numerical response for regression ones. A classifier s is a mapping while the regression function appears naturally to be the function s when we suppose that with . RF
Scaling random forests to Big Data
This section discusses the different strategies that can be used to scale RF to Big Data. These strategies differ from the original method, seqRF, at two different levels. The first difference stands in the implementation, that can be either sequential, using only one computational process (as in the original method), or parallel. The direct implementation of RF in parallel is denoted by parRF but is very limited if the sample size is large because it requires to handle in parallel several
Experiments
The present section is devoted to numerical experiments on a massive simulated dataset (15 millions of observations) as well as on a real world dataset (120 millions of observations). These simulations aim at illustrating and comparing the five variants of RF for Big Data introduced in Section 3. The experimental framework and the data simulation model are first presented and the baseline used for the comparison, seqRF, is described. Then, the four variants involving parallel implementations
Conclusion and discussion
This final section provides a short conclusion and opens two perspectives. The first one proposes to consider re-weighting RF as an alternative for tackling the lack of representativeness for BD-RF and the second one focuses on alternative online RF schemes and on RF for data streams.
Additional file 1 — R and python scripts used for the simulation
R scripts used in the simulation sections are available at https://github.com/tuxette/bigdatarf.
Conflict of interest statement
The authors declare that they have no competing interests.
Acknowledgements
The authors thank the editor and the two anonymous referees for their thorough comments and suggestions which really helped to deeply improve the paper. The authors are also grateful to the MIAT IT team and especially to Damien Berry, who provided a fast and efficient support for system and software configuration.
References (50)
- et al.
Mining data with random forests: a survey and results of new tests
Pattern Recognit.
(2011) - et al.
Variable selection using random forests
Pattern Recognit. Lett.
(2010) - et al.
On the use of MapReduce for imbalanced big data using random forest
Inf. Sci.
(2014) - et al.
Creating non-parametric bootstrap samples using Poisson frequencies
Comput. Methods Programs Biomed.
(2006) - et al.
A decision-theoretic generalization of on-line learning and an application to boosting
J. Comput. Syst. Sci.
(1997) - et al.
Challenges of big data analysis
Nat. Sci. Rev.
(2014) - et al.
Applying statistical thinking to ‘Big Data’ problems
Wiley Interdiscip. Rev.: Comput. Stat.
(2014) On statistics, computation and scalability
Bernoulli
(2013)- et al.
Data warehousing and analytics infrastructure at Facebook
- et al.
Big data – Retour vers le futur 3. De statisticien à data scientist
Big data for modern industry: challenges and trends
Scalable strategies for computing with massive data
J. Stat. Softw.
R: A Language and Environment for Statistical Computing
Statistique et big data analytics. Volumétrie, l'attaque des clones
A survey of statistical methods and computing for big data
Fast approximate spectral clustering
A scalable bootstrap for massive data
J. R. Stat. Soc., Ser. B, Stat. Methodol.
Scalable simple random sampling and stratified sampling
Approximate clustering via core-sets
Early accurate results for advanced analytics on MapReduce
Proceedings of the 28th International Conference on Very Large Data Bases
Proc. VLDB Endow.
Map-Reduce for machine learning on multicore
A split-and-conquer approach for analysis of extraordinarily large data
Stat. Sin.
Computational and statistical tradeoffs via convex relaxation
Proc. Natl. Acad. Sci. USA
Incremental support vector learning: analysis, implementation and application
J. Mach. Learn. Res.
On-line random forests
Cited by (240)
Improvement of pasture biomass modelling using high-resolution satellite imagery and machine learning
2024, Journal of Environmental ManagementA machine learning approach to identifying non-parental caregivers' risk for harsh caregiving towards infants in daycare centers
2024, Early Childhood Research QuarterlyExplainable AI models for predicting drop coalescence in microfluidics device
2024, Chemical Engineering JournalMachine learning technique combined with data fusion strategies: A tea grade discrimination platform
2023, Industrial Crops and ProductsPredicting aircraft trajectory uncertainties for terminal airspace design evaluation
2023, Journal of Air Transport Management