Regression tree-based active learning

Jose, Ashna; de Mendonça, João Paulo Almeida; Devijver, Emilie; Jakse, Noël; Monbet, Valérie; Poloni, Roberta

doi:10.1007/s10618-023-00951-7

Regression tree-based active learning

Published: 16 August 2023

Volume 38, pages 420–460, (2024)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Ashna Jose ORCID: orcid.org/0000-0002-4458-7291¹,
João Paulo Almeida de Mendonça¹,
Emilie Devijver²,
Noël Jakse¹,
Valérie Monbet³ &
…
Roberta Poloni¹

1 Citation
3 Altmetric
Explore all metrics

Abstract

Machine learning algorithms often require large training sets to perform well, but labeling such large amounts of data is not always feasible, as in many applications, substantial human effort and material cost is needed. Finding effective ways to reduce the size of training sets while maintaining the same performance is then crucial: one wants to choose the best sample of fixed size to be labeled among a given population, aiming at an accurate prediction of the response. This challenge has been studied in detail in classification, but not deeply enough in regression, which is known to be a more difficult task for active learning despite its need in practice. Few model-free active learning methods have been proposed that detect the new samples to be labeled using unlabeled data, but they lack the information of the conditional distribution between the response and the features. In this paper, we propose a standard regression tree-based active learning method for regression that improves significantly upon existing active learning approaches. It provides impressive results for small and large training sets and an appreciably low variance within several runs. We also exploit model-free approaches, and adapt them to our algorithm to utilize maximum information. Through experiments on numerous benchmark datasets, we demonstrate that our framework improves existing methods and is effective in learning a regression model from a very limited labeled dataset, reducing the sample size for a fixed level of performance, even with many features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model-Free and Model-Based Active Learning for Regression

Toward optimal probabilistic active learning using a Bayesian approach

Article Open access 04 May 2021

Active Learning Algorithms for Multi-label Data

Notes

These results were given for purely random Mondrian trees. Here, we use a regression tree that contains the knowledge of the training set, thus improving the structure of the tree.
https://archive.ics.uci.edu/.
For a deeper understanding, we illustrate our method on a generated multimodal dataset in Appendix 8 as well.
https://archive.ics.uci.edu/ml/datasets/superconduct- ivty+data/.

References

Burbidge R, Rowland JJ, King RD (2007) Active learning for regression based on query by committee. In: Yin H, Tino P, Corchado E, Byrne W, Yao X (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2007. Springer, Berlin, pp 209–218
Chapter Google Scholar
Cai W, Zhang M, Zhang Y (2017) Batch mode active learning for regression with expected model change. IEEE Trans Neural Networks Learn Syst 28(7):1668–1681
Article MathSciNet Google Scholar
Cai W, Zhang Y, Zhou J (2013) Maximizing expected model change for active learning in regression. In: 2013 IEEE 13th international conference on data mining, pp 51–60
Chan NN (1982) A-optimality for regression designs. J Math Anal Appl 87(1):45–50
Article MathSciNet Google Scholar
Chaudhuri K, Jain P, Natarajan N (2017) Active heteroscedastic regression. In: Proceedings of the 34th international conference on machine learning. Proceedings of machine learning research, vol 70, pp 694–702
Chauvet G, Tillé Y (2006) A fast algorithm for balanced sampling. Comput Stat 21(1):53–62
Article MathSciNet Google Scholar
Cohn D, Ghahramani Z, Jordan M (1994) Active learning with statistical models. In: Advances in neural information processing systems, vol 7
Goetz J, Tewari A, Zimmerman P (2018) Active learning for non-parametric regression using purely random trees. In: Advances in neural information processing systems, vol 31
Hazan E, Karnin Z (2014) Hard-margin active linear regression. In: Xing EP, Jebara T (eds) Proceedings of the 31st international conference on machine learning. Proceedings of Machine Learning Research, vol 32. PMLR, Bejing, pp 883–891
Holzmüller D, Zaverkin V, Kästner J, Steinwart I (2023) A framework and benchmark for deep batch active learning for regression
John RCS, Draper NR (1975) D-optimality for regression designs: a review. Technometrics 17(1):15–23
Article MathSciNet Google Scholar
Kaur H, Kaur H, Sharma A (2021) A review of recent advancement in superconductors. Mater Today Proc 37:3612–3614
Article Google Scholar
Lakshminarayanan B, Roy DM, Teh YW (2014) Mondrian forests: Efficient online random forests. Adv Neural Inf Process Syst 27:1
Google Scholar
Liu Z, Jiang X, Luo H, Fang W, Liu J, Wu D (2021) Pool-based unsupervised active learning for regression using iterative representativeness-diversity maximization. Pattern Recognit Lett 142:11–19
Article ADS Google Scholar
Luo Z, Hauskrecht M (2019) Region-based active learning with hierarchical and adaptive region construction, pp 441–449
O’Neill J, Jane Delany S, MacNamee B (2017) Model-free and model-based active learning for regression. Advances in computational intelligence systems. Springer, Cham, pp 375–386
Chapter Google Scholar
Polyzos KD, Lu Q, Giannakis GB (2022) Weighted ensembles for active learning with adaptivity
Pukelsheim F (2006) Optimal design of experiments (classics in applied mathematics). Society for Industrial and Applied Mathematics, USA
Book Google Scholar
Riis C, Antunes F, Hüttel FB, Azevedo CL, Pereira FC (2023) Bayesian active learning with fully Bayesian gaussian processes
Sabato S, Munos R (2014) Active regression by stratification. In: Proceedings of the 27th international conference on neural information processing systems—Volume 1. NIPS’14. MIT Press, Cambridge, pp 469–477
Willett R, Nowak R, Castro R (2005) Faster rates in regression via active learning. In: Advances in neural information processing systems, vol 18
Woods DC, Lewis SM, Eccleston JA, Russell KG (2006) Designs for generalized linear models with several variables and model uncertainty. Technometrics 48(2):284–292
Article MathSciNet Google Scholar
Wu D (2019) Pool-based sequential active learning for regression. IEEE Trans Neural Networks Learn Syst 30(5):1348–1359
Article MathSciNet Google Scholar
Wu D, Lin C-T, Huang J (2019) Active learning for regression using greedy sampling. Inf Sci 474:90–105
Article MathSciNet Google Scholar
Xue Y, Hauskrecht M (2019) Active learning of multi-class classification models from ordered class sets. Proc AAAI Conf Artif Intell 33(01):5589–5596
PubMed PubMed Central Google Scholar
Xue Y, Hauskrecht M (2017) Active learning of classification models with like-scale feedback. In: Proceedings of the SIAM international conference on data mining, pp 28–35
Yang M, Biedermann S, Tang E (2013) On optimal designs for nonlinear models: a general and efficient algorithm. J Am Stat Assoc 108(504):1411–1420
Article MathSciNet CAS Google Scholar
Yu H, Kim S (2010) Passive sampling for regression. In: 2010 IEEE international conference on data mining, pp 1151–1156
Zhang H, Ravi SS, Davidson I (2020) A graph-based approach for active learning in regression, pp 280–288
Zhao J, Sun S, Wang H, Cao Z (2020) Promoting active learning with mixtures of gaussian processes. Knowl-Based Syst 188:105044
Article Google Scholar

Download references

Funding

This work has been partially supported by MIAI@Grenoble Alpes (ANR-19-P31A-0003).

Author information

Authors and Affiliations

Univ. Grenoble Alpes, CNRS, Grenoble INP, SIMaP, 38000, Grenoble, France
Ashna Jose, João Paulo Almeida de Mendonça, Noël Jakse & Roberta Poloni
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000, Grenoble, France
Emilie Devijver
Univ. Rennes and Inria, CNRS, IRMAR-UMR 6625, 35000, Rennes, France
Valérie Monbet

Authors

Ashna Jose
View author publications
You can also search for this author in PubMed Google Scholar
João Paulo Almeida de Mendonça
View author publications
You can also search for this author in PubMed Google Scholar
Emilie Devijver
View author publications
You can also search for this author in PubMed Google Scholar
Noël Jakse
View author publications
You can also search for this author in PubMed Google Scholar
Valérie Monbet
View author publications
You can also search for this author in PubMed Google Scholar
Roberta Poloni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashna Jose.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Code availability

The code is available at https://github.com/AshnaJose/Regression-Tree-based-Active-Learning along with the datasets.

Additional information

Responsible editors: Tania Cerquitelli and Charalampos Tsourakakis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: RT-AL

Figure 7 gives an overview of the RT-AL algorithm.

Appendix 2: Initialisation

In Table 3, we compare using RMSE for 100 labeled samples (except for orange dataset where we consider only 60 labeled samples due to the smaller size of the dataset) the performance of AL methods requiring an initial set of labeled samples, when the initial samples are selected using random sampling (shown with (RS) in front of the method name), and when the initial samples are selected smarty (shown in bold) using iRDM for Airfoil dataset and GSx for Diabetes, Boston and Orange datasets. The datasets in the main paper for which RS was the best initialiser have not been shown here. We see from the table that indeed initialising the first model smartly helps in building a better model as the RMSEs for this case are lower than for the cases initialised by RS. Thus, the first set of samples should indeed be chosen smartly as it would eventually contribute to training a better model.

Table 3 Performance for 100 labeled samples (except for orange dataset where we consider only 60 labeled samples due to the smaller size of the dataset), depicted using RMSE (average over 200 repetitions) with and without smart initialisation

Full size table

Appendix 3: Hyperparameters for the final model: random forest

The two parameters of interest in the final regressor that we use, Random Forest, are the depth of the tree (defined by the minimum number of samples required to form a leaf), and the number of trees in the forest. We set the minimum samples in the leaf to 3 (while the default in scikit-learn is 1) to avoid over-fitting. Further, we take the number of trees in the forest to be 100 as it is a good estimate of the point where the performance of different methods that we test converge, throughout the prediction phase (as shown in Figs. 8 and 9 for the yacht dataset). Although hyperparameter optimisation for the final regressor is important in general, it depends on the samples selected, thus the method of sampling used, therefore it could be unfair to compare different methods with different hyperparameter optimisations. So we chose 100 trees, which is a safe hyperparameter for all the methods.

Appendix 4: Plots for Sect. 4.2

In this section, we present the two-component PCA plots and response histograms for all the real datasets studied in Sect. 4.2 of the main paper (Figs. 10, 11, 12, 13, 14 and 15). We also show the performance in prediction using our method and compare it with the state-of-the-art with the help of boxplots, that highlight the difference in the RMSE, and also in the variance and the number of outliers.

Appendix 5: Histograms of difference in RMSEs

In this section, we show the histograms of the error differences between our method and the state-of-the-art, to have a visual understanding of Table 2 of the main paper. With these histograms, we see that indeed our method is consistently a good performer for all the datasets. We also highlight that for cases where other approaches may give equivalent results compared to ours, the vales of the difference in errors is very low, thus not statistically significant (Figs. 16, 17, 18, 19, 20 and 21). This can be seen from the peak at 0 on the X axis, suggesting that there are many such cases where other methods may be better than us but not statistically.

Appendix 6: RT-AL with other classes of machine learning models

We use three different models for the final predictor to show how our method can be applied to other models. We show in Fig. 22 the performance in prediction using RMSE averaged over 200 runs for different stages in training using the model-based AL methods GSy, QBC, EMCM, MT, LCMD and our method RT-AL, for the superconductivity dataset (with N = 21,263 and D = 81). RS was used to initialise all the methods (except for GSy where GSx is used as proposed by the authors). The different models we use for the final predictor with hyperparameter optimisation for each run are described below:

Linear model Lasso: We use lasso model as implemented in Scikit-learn, while optimising alpha, and keeping the other hyperparameters as default.
Random Forest: We used the random forest method (RF) as implemented in Scikit-learn, while optimising the number of estimators in the forest, maximum number of features and minimum samples in the leaf, and keeping the other hyperparameters as default.
Multi Layer Perceptron Regressor: We use MLPRegressor (MLP) as implemented in Scikit-learn, while optimising hidden layer sizes, keeping the learning rate adaptive, maximum iterations 1000, setting early stopping to true and the other hyperparameters as default.

From the plots in Fig. 22 we can conclude that the performance of each model-based AL method varies based on the prediction model. However, our method with regression trees, RT-AL, is well positioned compared to other model-based AL algorithms irrespective of the prediction model, thus implying that RT-AL can indeed be applied to other models. Moreover, using random forest as the prediction model leads to the best performance among these three classes of models according to the RMSE score, thus justifying our choice for comparing all the methods. Thus, our method using regression trees is indeed very efficient and robust.

Appendix 7: Computation time

We show in Table 4 the time (in seconds) it takes to label 100 samples for all the datasets (except for orange dataset, where we show the times for 60 labeled samples). Note that for our method, the methods corresponding to the italicised times have been used as initialiser for the respective datasets, while for other model-based methods, RS is the initialiser. We see that our method is indeed very competitive, and most often takes the lowest times, especially among the model-based AL methods. RT-AL is far more efficient than QBC (with trees as models in the committee) which gave a performance close to ours for some datasets. We also would like to mention here that for cases where labeling is expensive, therefore not many samples can be used to construct the training set, it is more crucial to have an accurate model and model complexity may not play a big role in general.

Table 4 Mean computational time (in seconds) to label 100 samples (except for orange dataset where it is 60), for a series of 50 runs, for each dataset by column (N and D represent the total number of samples in the dataset and its dimension respectively). Their respective variance reported in brackets below

Full size table

Appendix 8: Illustration on a simulated dataset

It has been shown by Willett et al. (2005) that AL is not always interesting in regression, we try to answer why, and where it is indeed extremely necessary. We argue here that datasets with different distributions of the response and the features are the ones where AL is in fact most beneficial, and rather imperative when labeling is expensive. Keeping this idea in mind, we depict the importance of our method on a simulation that was generated particularly to have different structures for the features and the response (Fig. 23). As most methods focus on diversity in the features, we start our illustration with a simulation so as to control the link between the response and the features. We construct a D-dimensional feature space of N samples, such that it consists of c clusters. The response is then defined as a non-linear function of the first principal component of the features, so as to keep a weak relation between the two distributions.

In Fig. 24, we compare our approach to the RS, GSx and iRDM, for a simulation with \(N = 3000\), \(D = 15\) and \(c = 10\) the number of clusters. As expected, GSx and iRDM both underperform compared to RS because for both these methods, we provide additional information about the features at every step, while missing critical details of the structure of the response. The comparison thus shows that for data with weak correlation between the features and the response, picking points to be labeled without using any prior information at all (using RS) is better than adding details about the features, because it focuses on the wrong subset of points.

However, we see that adding information about the response using our regression trees is indeed helpful. It is clear from the figure that for all the methods in question, we observe a significant improvement by picking the set of points to be labeled using a regression tree that is built on the samples that are already labeled. Further, we also note that GSx + RT (RS) or iRDM + RT (RS) is better for such datasets, than GSx + RT (Diversity-based) or iRDM + RT (Representativity-based). This is because both GSx and iRDM do not work well for such datasets to begin with, so the knowledge they add in each leaf is also damageable. Thus, for datasets such as these, where AL schemes are hoped to be most interesting, we succeed to show that learning with simple regression trees alone reduces the size of the training set to be labeled by a good margin, for a desired level of accuracy. For example, from Fig. 24 we see that on an average, the performance with 120 samples well selected by our trees is the same as the performance with 140 samples uniformly selected, 180 samples selected by iRDM and much more when selecting using GSx.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jose, A., de Mendonça, J.P.A., Devijver, E. et al. Regression tree-based active learning. Data Min Knowl Disc 38, 420–460 (2024). https://doi.org/10.1007/s10618-023-00951-7

Download citation

Received: 30 November 2022
Accepted: 21 June 2023
Published: 16 August 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s10618-023-00951-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Regression tree-based active learning

Abstract

Access this article

Similar content being viewed by others

Model-Free and Model-Based Active Learning for Regression

Toward optimal probabilistic active learning using a Bayesian approach

Active Learning Algorithms for Multi-label Data

Notes

References

Funding