Skip to main content

Advertisement

Log in

Regression tree-based active learning

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Machine learning algorithms often require large training sets to perform well, but labeling such large amounts of data is not always feasible, as in many applications, substantial human effort and material cost is needed. Finding effective ways to reduce the size of training sets while maintaining the same performance is then crucial: one wants to choose the best sample of fixed size to be labeled among a given population, aiming at an accurate prediction of the response. This challenge has been studied in detail in classification, but not deeply enough in regression, which is known to be a more difficult task for active learning despite its need in practice. Few model-free active learning methods have been proposed that detect the new samples to be labeled using unlabeled data, but they lack the information of the conditional distribution between the response and the features. In this paper, we propose a standard regression tree-based active learning method for regression that improves significantly upon existing active learning approaches. It provides impressive results for small and large training sets and an appreciably low variance within several runs. We also exploit model-free approaches, and adapt them to our algorithm to utilize maximum information. Through experiments on numerous benchmark datasets, we demonstrate that our framework improves existing methods and is effective in learning a regression model from a very limited labeled dataset, reducing the sample size for a fixed level of performance, even with many features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. These results were given for purely random Mondrian trees. Here, we use a regression tree that contains the knowledge of the training set, thus improving the structure of the tree.

  2. https://archive.ics.uci.edu/.

  3. For a deeper understanding, we illustrate our method on a generated multimodal dataset in Appendix 8 as well.

  4. https://archive.ics.uci.edu/ml/datasets/superconduct- ivty+data/.

References

  • Burbidge R, Rowland JJ, King RD (2007) Active learning for regression based on query by committee. In: Yin H, Tino P, Corchado E, Byrne W, Yao X (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2007. Springer, Berlin, pp 209–218

    Chapter  Google Scholar 

  • Cai W, Zhang M, Zhang Y (2017) Batch mode active learning for regression with expected model change. IEEE Trans Neural Networks Learn Syst 28(7):1668–1681

    Article  MathSciNet  Google Scholar 

  • Cai W, Zhang Y, Zhou J (2013) Maximizing expected model change for active learning in regression. In: 2013 IEEE 13th international conference on data mining, pp 51–60

  • Chan NN (1982) A-optimality for regression designs. J Math Anal Appl 87(1):45–50

    Article  MathSciNet  Google Scholar 

  • Chaudhuri K, Jain P, Natarajan N (2017) Active heteroscedastic regression. In: Proceedings of the 34th international conference on machine learning. Proceedings of machine learning research, vol 70, pp 694–702

  • Chauvet G, Tillé Y (2006) A fast algorithm for balanced sampling. Comput Stat 21(1):53–62

    Article  MathSciNet  Google Scholar 

  • Cohn D, Ghahramani Z, Jordan M (1994) Active learning with statistical models. In: Advances in neural information processing systems, vol 7

  • Goetz J, Tewari A, Zimmerman P (2018) Active learning for non-parametric regression using purely random trees. In: Advances in neural information processing systems, vol 31

  • Hazan E, Karnin Z (2014) Hard-margin active linear regression. In: Xing EP, Jebara T (eds) Proceedings of the 31st international conference on machine learning. Proceedings of Machine Learning Research, vol 32. PMLR, Bejing, pp 883–891

  • Holzmüller D, Zaverkin V, Kästner J, Steinwart I (2023) A framework and benchmark for deep batch active learning for regression

  • John RCS, Draper NR (1975) D-optimality for regression designs: a review. Technometrics 17(1):15–23

    Article  MathSciNet  Google Scholar 

  • Kaur H, Kaur H, Sharma A (2021) A review of recent advancement in superconductors. Mater Today Proc 37:3612–3614

    Article  Google Scholar 

  • Lakshminarayanan B, Roy DM, Teh YW (2014) Mondrian forests: Efficient online random forests. Adv Neural Inf Process Syst 27:1

    Google Scholar 

  • Liu Z, Jiang X, Luo H, Fang W, Liu J, Wu D (2021) Pool-based unsupervised active learning for regression using iterative representativeness-diversity maximization. Pattern Recognit Lett 142:11–19

    Article  ADS  Google Scholar 

  • Luo Z, Hauskrecht M (2019) Region-based active learning with hierarchical and adaptive region construction, pp 441–449

  • O’Neill J, Jane Delany S, MacNamee B (2017) Model-free and model-based active learning for regression. Advances in computational intelligence systems. Springer, Cham, pp 375–386

    Chapter  Google Scholar 

  • Polyzos KD, Lu Q, Giannakis GB (2022) Weighted ensembles for active learning with adaptivity

  • Pukelsheim F (2006) Optimal design of experiments (classics in applied mathematics). Society for Industrial and Applied Mathematics, USA

    Book  Google Scholar 

  • Riis C, Antunes F, Hüttel FB, Azevedo CL, Pereira FC (2023) Bayesian active learning with fully Bayesian gaussian processes

  • Sabato S, Munos R (2014) Active regression by stratification. In: Proceedings of the 27th international conference on neural information processing systems—Volume 1. NIPS’14. MIT Press, Cambridge, pp 469–477

  • Willett R, Nowak R, Castro R (2005) Faster rates in regression via active learning. In: Advances in neural information processing systems, vol 18

  • Woods DC, Lewis SM, Eccleston JA, Russell KG (2006) Designs for generalized linear models with several variables and model uncertainty. Technometrics 48(2):284–292

    Article  MathSciNet  Google Scholar 

  • Wu D (2019) Pool-based sequential active learning for regression. IEEE Trans Neural Networks Learn Syst 30(5):1348–1359

    Article  MathSciNet  Google Scholar 

  • Wu D, Lin C-T, Huang J (2019) Active learning for regression using greedy sampling. Inf Sci 474:90–105

    Article  MathSciNet  Google Scholar 

  • Xue Y, Hauskrecht M (2019) Active learning of multi-class classification models from ordered class sets. Proc AAAI Conf Artif Intell 33(01):5589–5596

    PubMed  PubMed Central  Google Scholar 

  • Xue Y, Hauskrecht M (2017) Active learning of classification models with like-scale feedback. In: Proceedings of the SIAM international conference on data mining, pp 28–35

  • Yang M, Biedermann S, Tang E (2013) On optimal designs for nonlinear models: a general and efficient algorithm. J Am Stat Assoc 108(504):1411–1420

    Article  MathSciNet  CAS  Google Scholar 

  • Yu H, Kim S (2010) Passive sampling for regression. In: 2010 IEEE international conference on data mining, pp 1151–1156

  • Zhang H, Ravi SS, Davidson I (2020) A graph-based approach for active learning in regression, pp 280–288

  • Zhao J, Sun S, Wang H, Cao Z (2020) Promoting active learning with mixtures of gaussian processes. Knowl-Based Syst 188:105044

    Article  Google Scholar 

Download references

Funding

This work has been partially supported by MIAI@Grenoble Alpes (ANR-19-P31A-0003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashna Jose.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Code availability

The code is available at https://github.com/AshnaJose/Regression-Tree-based-Active-Learning along with the datasets.

Additional information

Responsible editors: Tania Cerquitelli and Charalampos Tsourakakis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: RT-AL

Figure 7 gives an overview of the RT-AL algorithm.

Fig. 7
figure 7

Flowchart giving an overview of regression tree-based active learning (RT-AL)

Appendix 2: Initialisation

In Table 3, we compare using RMSE for 100 labeled samples (except for orange dataset where we consider only 60 labeled samples due to the smaller size of the dataset) the performance of AL methods requiring an initial set of labeled samples, when the initial samples are selected using random sampling (shown with (RS) in front of the method name), and when the initial samples are selected smarty (shown in bold) using iRDM for Airfoil dataset and GSx for Diabetes, Boston and Orange datasets. The datasets in the main paper for which RS was the best initialiser have not been shown here. We see from the table that indeed initialising the first model smartly helps in building a better model as the RMSEs for this case are lower than for the cases initialised by RS. Thus, the first set of samples should indeed be chosen smartly as it would eventually contribute to training a better model.

Table 3 Performance for 100 labeled samples (except for orange dataset where we consider only 60 labeled samples due to the smaller size of the dataset), depicted using RMSE (average over 200 repetitions) with and without smart initialisation

Appendix 3: Hyperparameters for the final model: random forest

The two parameters of interest in the final regressor that we use, Random Forest, are the depth of the tree (defined by the minimum number of samples required to form a leaf), and the number of trees in the forest. We set the minimum samples in the leaf to 3 (while the default in scikit-learn is 1) to avoid over-fitting. Further, we take the number of trees in the forest to be 100 as it is a good estimate of the point where the performance of different methods that we test converge, throughout the prediction phase (as shown in Figs. 8 and 9 for the yacht dataset). Although hyperparameter optimisation for the final regressor is important in general, it depends on the samples selected, thus the method of sampling used, therefore it could be unfair to compare different methods with different hyperparameter optimisations. So we chose 100 trees, which is a safe hyperparameter for all the methods.

Fig. 8
figure 8

The performance in prediction using RMSE averaged over 200 runs for different stages in the training using passive and active learning methods for Yacht dataset, as a function of the number of trees in the Random Forest

Fig. 9
figure 9

The performance in prediction using RMSE averaged over 200 runs for different stages in the training using passive and active learning methods for Yacht dataset, as a function of the number of trees in the Random Forest

Appendix 4: Plots for Sect. 4.2

In this section, we present the two-component PCA plots and response histograms for all the real datasets studied in Sect. 4.2 of the main paper (Figs. 10, 11, 12, 13, 14 and 15). We also show the performance in prediction using our method and compare it with the state-of-the-art with the help of boxplots, that highlight the difference in the RMSE, and also in the variance and the number of outliers.

Fig. 10
figure 10

Two-component PCA and histogram of the response of the Superconductivity dataset. Performance in prediction using boxplots over 200 runs when the training set is constructed using passive and AL approaches. The training set size varies from 20 to 200

Fig. 11
figure 11

Two-component PCA and histogram of the response of the Airfoil dataset. Performance in prediction using boxplots over 200 runs when the training set is constructed using passive and AL approaches. The training set size varies from 20 to 200

Fig. 12
figure 12

Two-component PCA and histogram of the response of the Boston Housing dataset. Performance in prediction using boxplots over 200 runs when the training set is constructed using passive and AL approaches. The training set size varies from 20 to 200

Fig. 13
figure 13

Two-component PCA and histogram of the response of the Diabetes dataset. Performance in prediction using boxplots over 200 runs when the training set is constructed using passive and AL approaches. The training set size varies from 20 to 200

Fig. 14
figure 14

Two-component PCA and histogram of the response of the Orange juice dataset. Performance in prediction using boxplots over 200 runs when the training set is constructed using passive and AL approaches. The training set size varies from 20 to 60

Fig. 15
figure 15

Two-component PCA and histogram of the response of the Yacht hydrodynamics dataset. Performance in prediction using boxplots over 200 runs when the training set is constructed using passive and AL approaches. The training set size varies from 20 to 140

Appendix 5: Histograms of difference in RMSEs

In this section, we show the histograms of the error differences between our method and the state-of-the-art, to have a visual understanding of Table 2 of the main paper. With these histograms, we see that indeed our method is consistently a good performer for all the datasets. We also highlight that for cases where other approaches may give equivalent results compared to ours, the vales of the difference in errors is very low, thus not statistically significant (Figs. 16, 17, 18, 19, 20 and 21). This can be seen from the peak at 0 on the X axis, suggesting that there are many such cases where other methods may be better than us but not statistically.

Fig. 16
figure 16

Histograms depicting difference in RMSE (E) of our approach compared to the state-of-the-art over a series of 200 experiments for the airfoil dataset, for 100 labeled samples

Fig. 17
figure 17

Histograms depicting difference in RMSE (E) of our approach compared to the state-of-the-art over a series of 200 experiments for the yacht dataset, for 100 labeled samples

Fig. 18
figure 18

Histograms depicting difference in RMSE (E) of our approach compared to the state-of-the-art over a series of 200 experiments for the diabetes dataset, for 100 labeled samples

Fig. 19
figure 19

Histograms depicting difference in RMSE (E) of our approach compared to the state-of-the-art over a series of 200 experiments for the boston housing dataset, for 100 labeled samples

Fig. 20
figure 20

Histograms depicting difference in RMSE (E) of our approach compared to the state-of-the-art over a series of 200 experiments for the Superconductivity dataset, for 100 labeled samples

Fig. 21
figure 21

Histograms depicting difference in RMSE (E) of our approach compared to the state-of-the-art over a series of 200 experiments for the orange dataset, for 100 labeled samples

Appendix 6: RT-AL with other classes of machine learning models

We use three different models for the final predictor to show how our method can be applied to other models. We show in Fig. 22 the performance in prediction using RMSE averaged over 200 runs for different stages in training using the model-based AL methods GSy, QBC, EMCM, MT, LCMD and our method RT-AL, for the superconductivity dataset (with N = 21,263 and D = 81). RS was used to initialise all the methods (except for GSy where GSx is used as proposed by the authors). The different models we use for the final predictor with hyperparameter optimisation for each run are described below:

  • Linear model Lasso: We use lasso model as implemented in Scikit-learn, while optimising alpha, and keeping the other hyperparameters as default.

  • Random Forest: We used the random forest method (RF) as implemented in Scikit-learn, while optimising the number of estimators in the forest, maximum number of features and minimum samples in the leaf, and keeping the other hyperparameters as default.

  • Multi Layer Perceptron Regressor: We use MLPRegressor (MLP) as implemented in Scikit-learn, while optimising hidden layer sizes, keeping the learning rate adaptive, maximum iterations 1000, setting early stopping to true and the other hyperparameters as default.

Fig. 22
figure 22

The performance in prediction using RMSE averaged over 200 runs for different stages in the training using model-based active learning methods for the superconductivity dataset (N = 21,263, D = 81)

From the plots in Fig. 22 we can conclude that the performance of each model-based AL method varies based on the prediction model. However, our method with regression trees, RT-AL, is well positioned compared to other model-based AL algorithms irrespective of the prediction model, thus implying that RT-AL can indeed be applied to other models. Moreover, using random forest as the prediction model leads to the best performance among these three classes of models according to the RMSE score, thus justifying our choice for comparing all the methods. Thus, our method using regression trees is indeed very efficient and robust.

Appendix 7: Computation time

We show in Table 4 the time (in seconds) it takes to label 100 samples for all the datasets (except for orange dataset, where we show the times for 60 labeled samples). Note that for our method, the methods corresponding to the italicised times have been used as initialiser for the respective datasets, while for other model-based methods, RS is the initialiser. We see that our method is indeed very competitive, and most often takes the lowest times, especially among the model-based AL methods. RT-AL is far more efficient than QBC (with trees as models in the committee) which gave a performance close to ours for some datasets. We also would like to mention here that for cases where labeling is expensive, therefore not many samples can be used to construct the training set, it is more crucial to have an accurate model and model complexity may not play a big role in general.

Table 4 Mean computational time (in seconds) to label 100 samples (except for orange dataset where it is 60), for a series of 50 runs, for each dataset by column (N and D represent the total number of samples in the dataset and its dimension respectively). Their respective variance reported in brackets below

Appendix 8: Illustration on a simulated dataset

It has been shown by Willett et al. (2005) that AL is not always interesting in regression, we try to answer why, and where it is indeed extremely necessary. We argue here that datasets with different distributions of the response and the features are the ones where AL is in fact most beneficial, and rather imperative when labeling is expensive. Keeping this idea in mind, we depict the importance of our method on a simulation that was generated particularly to have different structures for the features and the response (Fig. 23). As most methods focus on diversity in the features, we start our illustration with a simulation so as to control the link between the response and the features. We construct a D-dimensional feature space of N samples, such that it consists of c clusters. The response is then defined as a non-linear function of the first principal component of the features, so as to keep a weak relation between the two distributions.

Fig. 23
figure 23

Two-component PCA and histogram of the response of the generated dataset

In Fig. 24, we compare our approach to the RS, GSx and iRDM, for a simulation with \(N = 3000\), \(D = 15\) and \(c = 10\) the number of clusters. As expected, GSx and iRDM both underperform compared to RS because for both these methods, we provide additional information about the features at every step, while missing critical details of the structure of the response. The comparison thus shows that for data with weak correlation between the features and the response, picking points to be labeled without using any prior information at all (using RS) is better than adding details about the features, because it focuses on the wrong subset of points.

Fig. 24
figure 24

Performance in prediction using boxplots over 200 runs when the training set is constructed using passive/model-free AL or RT-AL (grey and white) on a generated dataset, using the query criteria mentioned in parenthesis for RT-AL. The training set size varies from 20 to 200

However, we see that adding information about the response using our regression trees is indeed helpful. It is clear from the figure that for all the methods in question, we observe a significant improvement by picking the set of points to be labeled using a regression tree that is built on the samples that are already labeled. Further, we also note that GSx + RT (RS) or iRDM + RT (RS) is better for such datasets, than GSx + RT (Diversity-based) or iRDM + RT (Representativity-based). This is because both GSx and iRDM do not work well for such datasets to begin with, so the knowledge they add in each leaf is also damageable. Thus, for datasets such as these, where AL schemes are hoped to be most interesting, we succeed to show that learning with simple regression trees alone reduces the size of the training set to be labeled by a good margin, for a desired level of accuracy. For example, from Fig. 24 we see that on an average, the performance with 120 samples well selected by our trees is the same as the performance with 140 samples uniformly selected, 180 samples selected by iRDM and much more when selecting using GSx.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jose, A., de Mendonça, J.P.A., Devijver, E. et al. Regression tree-based active learning. Data Min Knowl Disc 38, 420–460 (2024). https://doi.org/10.1007/s10618-023-00951-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-023-00951-7

Keywords

Navigation