Elsevier

Big Data Research

Volume 9, September 2017, Pages 28-46
Big Data Research

Random Forests for Big Data

https://doi.org/10.1016/j.bdr.2017.07.003Get rights and content

Abstract

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how out-of-bag error is addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or “divide-and-conquer” approaches. The fifth variant is related to online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations.

Introduction

Big Data is one of the major challenges of statistical science and a lot of recent references start to think about the numerous consequences of this new context from the algorithmic viewpoint and for the theoretical implications of this new framework [1], [2], [3]. Big Data always involve massive data: for instance, Thusoo et al. [4] indicate that Facebook© had more than 21 PB of data in 2010. They also often include data streams and data heterogeneity [5]. On a practical point of view, they are characterized by the fact that data are frequently not structured data, properly indexed in a database. Thus, simple queries cannot be easily performed on such data. These features lead to the famous three Vs (Volume, Velocity and Variety) highlighted by the Gartner, Inc., the advisory company about information technology research,1 now often augmented with other Vs [6]. In the most extreme situations, data can even have a size too large to fit in a single computer memory. Then data are distributed among several computers. In this case, the distribution of the data is managed using specific frameworks dedicated to shared storage computing environments, such as Hadoop.2

For statistical science, the problem posed by this large amount of data is twofold: first, as many statistical procedures have devoted few attention to computational runtimes, they can take too long to provide results in an acceptable time. When dealing with complex tasks, such as learning a prediction model or performing a complex exploratory analysis, this issue can occur even if the dataset would be considered of a moderate size for other simpler tasks. Also, as pointed out in [7], the notion of Big Data depends itself on the available computing resources. This is especially true when relying on the free statistical software R [8], massively used in the statistical community, which capabilities are strictly limited by RAM. In this case, data can be considered as “large” if their size exceeds 20% of RAM and as “massive” if it exceeds 50% of RAM, because this amount of data strongly limits the available memory for learning the statistical model itself. For memory demanding statistical methods and implementations, the RAM can even be overloaded with datasets occupying a very moderate amount of the RAM. As pointed out in [3], in the near future, statistics will have to deal with problems of scale and computational complexity to remain relevant. In particular, the collaboration between statisticians and computer scientists is needed to control runtimes that will maintain the statistical procedures usable on large-scale data while ensuring good statistical properties.

Recently, some statistical methods have been adapted to process Big Data, including linear regression models, clustering methods and bootstrapping schemes [9], [10]. The main proposed strategies are based on i) subsampling, ii) divide and conquer approach, iii) algorithm weakening and iv) online processing.

Subsampling is probably the simplest way to handle large datasets. It is proved efficient to approximate spectral analysis of large matrices using an approximate decomposition, such as the Nyström algorithm [11]. It is also a valuable strategy to produce an approximate bootstrap scheme [12]. Simple random sampling often produces a representative enough subsample but can be hard to obtain if data are distributed over different computers and the subsample itself has to be built in parallel: online subsampling strategies allowing stratified sampling are presented in [13] and can overcome this problem. Improved subsampling strategies can also be designed, like the core-set strategy used for clustering problems in [14], that extracts a relevant small set of points to perform approximate clustering efficiently. Finally, an alternative to alleviate the impact of the subsampling without the need to use sophisticated subsampling schemes is to perform several subsamplings and to combine the different results [15].

Divide and conquer approach consists in splitting the problem into several smaller problems and in gathering the different results in a final step. This approach is the one followed in the popular MapReduce programming paradigm [16]. Most of the time, the combination is based on a simple aggregation or averaging of the different results but this simple method might lead to biased estimations in some statistical models, as simple as a linear model. Solutions include re-weighting the different results [17].

Algorithm weakening is a very different approach, designed for methods based on convex optimization problems [18]. This method explicitly treats the trade-off between computational time and statistical accuracy using a hierarchy of relaxed optimization problems with increasing complexity.

Finally, online approaches update the results with sequential steps, each having a low computational cost. It very often requires a specific rewriting of the method to single out the specific contribution of a given observation to the method. In this case, the online update is strictly equivalent to the processing of the whole dataset but with a reduced computational time [19]. However, in most cases, such an equivalence can not be obtained and a modification of the original method is needed to allow online updates [20].

It has to be noted that only a few papers really address the question of the difference between the “small data” standard framework compared to the Big Data in terms of statistical accuracy when approximate versions of the original approach are used to deal with the large sample size. Noticeable exceptions are the article of Kleiner et al. [12] who prove that their “Bag of Little Bootstraps” method is statistically equivalent to the standard bootstrap, the article of Chen and Xie [17] who demonstrate asymptotic equivalence of their “divide-and-conquer” based estimator with the estimator based on all data in the setting of linear regression and the article of Yan et al. [11] who show that the mis-clustering rate of their subsampling approach, compared to what would have been obtained with a direct approach on the whole dataset, converges to zero when the subsample size grows (in an unsupervised setting).

Based on decision trees and combined with aggregation and bootstrap ideas, random forests (abbreviated RF in the sequel), were introduced by Breiman [21]. They are a powerful nonparametric statistical method allowing to consider regression problems as well as two-class and multi-class classification problems, in a single and versatile framework. The consistency of RF has recently been proved by Scornet et al. [22], to cite the most recent result. On a practical point of view, RF are widely used [23], [24] and exhibit extremely high performance with only a few parameters to tune. Since RF are based on the definition of several independent trees, it is thus straightforward to obtain a parallel and faster implementation of the RF method, in which many trees are built in parallel on different cores. However, direct parallel training of the trees might be intractable in practice, due to the large size of the bootstrap samples. As RF also include intensive resampling, it is natural to consider adapted bootstrapping schemes for the massive online context, in addition to parallel processing.

Even if the method has already been adapted and implemented to handle Big Data in various distributed environments (see, for instance, the libraries Mahout3 or MLib, the latter for the distributed framework Spark,4 among others), a lot of questions remain open. In this paper, we do not seek to make an exhaustive description of the various implementations of RF in scalable environments but we will highlight some problems posed to RF by the Big Data framework, describe several standard strategies that can be used and discuss their main features, drawbacks and differences with the original approach. We finally experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or “divide-and-conquer” approaches. The fifth variant relates to online learning of RF. To the best of our knowledge, no weakening strategy has been developed for RF.

Since the free statistical software R [8] is de facto the esperanto in the statistical community, and since the most flexible and widely used programs for designing random forests are also available in R, we have adopted it for numerical experiments as much as possible. More precisely, the R package randomForest, implementing the original RF algorithm using Breiman and Cutler's Fortran code, contains many options together with a detailed documentation. It has then been used in almost all experiments. The only exception is for online RF for which no implementation in R is available. A python library was used, as an alternative tool in order to provide a comparison of online learning with the alternative Big Data variants.

The paper is organized as follows. After this introduction, we briefly recall some basic facts about RF in Section 2. Then, Section 3 is focused on strategies for scaling random forests to Big Data: some proposals about RF in parallel environments are reviewed, as well as a description of online strategies. The section includes a comparison of the features of every method and a discussion about the estimation of the out-of-bag error. Section 4 is devoted to numerical experiments on two massive datasets, an extensive study on a simulated one and an application to real world data. Finally, Section 5 collects some conclusions and discusses two open perspectives.

Section snippets

Random forests

Denoting by L={(x1,y1),,(xn,yn)} a learning set of independent observations of the random vector (X,Y), we distinguish X=(X1,...,Xp) where XRp is the vector of the predictors (or explanatory variables) from YY the explained variable, where Y is either a class label for classification problems or a numerical response for regression ones. A classifier s is a mapping s:RpY while the regression function appears naturally to be the function s when we suppose that Y=s(X)+ε with E[ε|X]=0. RF

Scaling random forests to Big Data

This section discusses the different strategies that can be used to scale RF to Big Data. These strategies differ from the original method, seqRF, at two different levels. The first difference stands in the implementation, that can be either sequential, using only one computational process (as in the original method), or parallel. The direct implementation of RF in parallel is denoted by parRF but is very limited if the sample size is large because it requires to handle in parallel several

Experiments

The present section is devoted to numerical experiments on a massive simulated dataset (15 millions of observations) as well as on a real world dataset (120 millions of observations). These simulations aim at illustrating and comparing the five variants of RF for Big Data introduced in Section 3. The experimental framework and the data simulation model are first presented and the baseline used for the comparison, seqRF, is described. Then, the four variants involving parallel implementations

Conclusion and discussion

This final section provides a short conclusion and opens two perspectives. The first one proposes to consider re-weighting RF as an alternative for tackling the lack of representativeness for BD-RF and the second one focuses on alternative online RF schemes and on RF for data streams.

Additional file 1 — R and python scripts used for the simulation

R scripts used in the simulation sections are available at https://github.com/tuxette/bigdatarf.

Conflict of interest statement

The authors declare that they have no competing interests.

Acknowledgements

The authors thank the editor and the two anonymous referees for their thorough comments and suggestions which really helped to deeply improve the paper. The authors are also grateful to the MIAT IT team and especially to Damien Berry, who provided a fast and efficient support for system and software configuration.

References (50)

  • S. Yin et al.

    Big data for modern industry: challenges and trends

  • M. Kane et al.

    Scalable strategies for computing with massive data

    J. Stat. Softw.

    (2013)
  • R: A Language and Environment for Statistical Computing

    (2016)
  • P. Besse et al.

    Statistique et big data analytics. Volumétrie, l'attaque des clones

  • C. Wang et al.

    A survey of statistical methods and computing for big data

  • D. Yan et al.

    Fast approximate spectral clustering

  • A. Kleiner et al.

    A scalable bootstrap for massive data

    J. R. Stat. Soc., Ser. B, Stat. Methodol.

    (2014)
  • X. Meng

    Scalable simple random sampling and stratified sampling

  • M. Bǎdoiu et al.

    Approximate clustering via core-sets

  • N. Laptev et al.

    Early accurate results for advanced analytics on MapReduce

    Proceedings of the 28th International Conference on Very Large Data Bases

    Proc. VLDB Endow.

    (2012)
  • C. Chu et al.

    Map-Reduce for machine learning on multicore

  • X. Chen et al.

    A split-and-conquer approach for analysis of extraordinarily large data

    Stat. Sin.

    (2014)
  • V. Chandrasekaran et al.

    Computational and statistical tradeoffs via convex relaxation

    Proc. Natl. Acad. Sci. USA

    (2013)
  • P. Laskov et al.

    Incremental support vector learning: analysis, implementation and application

    J. Mach. Learn. Res.

    (2006)
  • A. Saffari et al.

    On-line random forests

  • Cited by (240)

    View all citing articles on Scopus
    View full text