Abstract
Machine learning algorithms often require large training sets to perform well, but labeling such large amounts of data is not always feasible, as in many applications, substantial human effort and material cost is needed. Finding effective ways to reduce the size of training sets while maintaining the same performance is then crucial: one wants to choose the best sample of fixed size to be labeled among a given population, aiming at an accurate prediction of the response. This challenge has been studied in detail in classification, but not deeply enough in regression, which is known to be a more difficult task for active learning despite its need in practice. Few model-free active learning methods have been proposed that detect the new samples to be labeled using unlabeled data, but they lack the information of the conditional distribution between the response and the features. In this paper, we propose a standard regression tree-based active learning method for regression that improves significantly upon existing active learning approaches. It provides impressive results for small and large training sets and an appreciably low variance within several runs. We also exploit model-free approaches, and adapt them to our algorithm to utilize maximum information. Through experiments on numerous benchmark datasets, we demonstrate that our framework improves existing methods and is effective in learning a regression model from a very limited labeled dataset, reducing the sample size for a fixed level of performance, even with many features.
Similar content being viewed by others
Notes
These results were given for purely random Mondrian trees. Here, we use a regression tree that contains the knowledge of the training set, thus improving the structure of the tree.
For a deeper understanding, we illustrate our method on a generated multimodal dataset in Appendix 8 as well.
References
Burbidge R, Rowland JJ, King RD (2007) Active learning for regression based on query by committee. In: Yin H, Tino P, Corchado E, Byrne W, Yao X (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2007. Springer, Berlin, pp 209–218
Cai W, Zhang M, Zhang Y (2017) Batch mode active learning for regression with expected model change. IEEE Trans Neural Networks Learn Syst 28(7):1668–1681
Cai W, Zhang Y, Zhou J (2013) Maximizing expected model change for active learning in regression. In: 2013 IEEE 13th international conference on data mining, pp 51–60
Chan NN (1982) A-optimality for regression designs. J Math Anal Appl 87(1):45–50
Chaudhuri K, Jain P, Natarajan N (2017) Active heteroscedastic regression. In: Proceedings of the 34th international conference on machine learning. Proceedings of machine learning research, vol 70, pp 694–702
Chauvet G, Tillé Y (2006) A fast algorithm for balanced sampling. Comput Stat 21(1):53–62
Cohn D, Ghahramani Z, Jordan M (1994) Active learning with statistical models. In: Advances in neural information processing systems, vol 7
Goetz J, Tewari A, Zimmerman P (2018) Active learning for non-parametric regression using purely random trees. In: Advances in neural information processing systems, vol 31
Hazan E, Karnin Z (2014) Hard-margin active linear regression. In: Xing EP, Jebara T (eds) Proceedings of the 31st international conference on machine learning. Proceedings of Machine Learning Research, vol 32. PMLR, Bejing, pp 883–891
Holzmüller D, Zaverkin V, Kästner J, Steinwart I (2023) A framework and benchmark for deep batch active learning for regression
John RCS, Draper NR (1975) D-optimality for regression designs: a review. Technometrics 17(1):15–23
Kaur H, Kaur H, Sharma A (2021) A review of recent advancement in superconductors. Mater Today Proc 37:3612–3614
Lakshminarayanan B, Roy DM, Teh YW (2014) Mondrian forests: Efficient online random forests. Adv Neural Inf Process Syst 27:1
Liu Z, Jiang X, Luo H, Fang W, Liu J, Wu D (2021) Pool-based unsupervised active learning for regression using iterative representativeness-diversity maximization. Pattern Recognit Lett 142:11–19
Luo Z, Hauskrecht M (2019) Region-based active learning with hierarchical and adaptive region construction, pp 441–449
O’Neill J, Jane Delany S, MacNamee B (2017) Model-free and model-based active learning for regression. Advances in computational intelligence systems. Springer, Cham, pp 375–386
Polyzos KD, Lu Q, Giannakis GB (2022) Weighted ensembles for active learning with adaptivity
Pukelsheim F (2006) Optimal design of experiments (classics in applied mathematics). Society for Industrial and Applied Mathematics, USA
Riis C, Antunes F, Hüttel FB, Azevedo CL, Pereira FC (2023) Bayesian active learning with fully Bayesian gaussian processes
Sabato S, Munos R (2014) Active regression by stratification. In: Proceedings of the 27th international conference on neural information processing systems—Volume 1. NIPS’14. MIT Press, Cambridge, pp 469–477
Willett R, Nowak R, Castro R (2005) Faster rates in regression via active learning. In: Advances in neural information processing systems, vol 18
Woods DC, Lewis SM, Eccleston JA, Russell KG (2006) Designs for generalized linear models with several variables and model uncertainty. Technometrics 48(2):284–292
Wu D (2019) Pool-based sequential active learning for regression. IEEE Trans Neural Networks Learn Syst 30(5):1348–1359
Wu D, Lin C-T, Huang J (2019) Active learning for regression using greedy sampling. Inf Sci 474:90–105
Xue Y, Hauskrecht M (2019) Active learning of multi-class classification models from ordered class sets. Proc AAAI Conf Artif Intell 33(01):5589–5596
Xue Y, Hauskrecht M (2017) Active learning of classification models with like-scale feedback. In: Proceedings of the SIAM international conference on data mining, pp 28–35
Yang M, Biedermann S, Tang E (2013) On optimal designs for nonlinear models: a general and efficient algorithm. J Am Stat Assoc 108(504):1411–1420
Yu H, Kim S (2010) Passive sampling for regression. In: 2010 IEEE international conference on data mining, pp 1151–1156
Zhang H, Ravi SS, Davidson I (2020) A graph-based approach for active learning in regression, pp 280–288
Zhao J, Sun S, Wang H, Cao Z (2020) Promoting active learning with mixtures of gaussian processes. Knowl-Based Syst 188:105044
Funding
This work has been partially supported by MIAI@Grenoble Alpes (ANR-19-P31A-0003).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Code availability
The code is available at https://github.com/AshnaJose/Regression-Tree-based-Active-Learning along with the datasets.
Additional information
Responsible editors: Tania Cerquitelli and Charalampos Tsourakakis.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: RT-AL
Figure 7 gives an overview of the RT-AL algorithm.
Appendix 2: Initialisation
In Table 3, we compare using RMSE for 100 labeled samples (except for orange dataset where we consider only 60 labeled samples due to the smaller size of the dataset) the performance of AL methods requiring an initial set of labeled samples, when the initial samples are selected using random sampling (shown with (RS) in front of the method name), and when the initial samples are selected smarty (shown in bold) using iRDM for Airfoil dataset and GSx for Diabetes, Boston and Orange datasets. The datasets in the main paper for which RS was the best initialiser have not been shown here. We see from the table that indeed initialising the first model smartly helps in building a better model as the RMSEs for this case are lower than for the cases initialised by RS. Thus, the first set of samples should indeed be chosen smartly as it would eventually contribute to training a better model.
Appendix 3: Hyperparameters for the final model: random forest
The two parameters of interest in the final regressor that we use, Random Forest, are the depth of the tree (defined by the minimum number of samples required to form a leaf), and the number of trees in the forest. We set the minimum samples in the leaf to 3 (while the default in scikit-learn is 1) to avoid over-fitting. Further, we take the number of trees in the forest to be 100 as it is a good estimate of the point where the performance of different methods that we test converge, throughout the prediction phase (as shown in Figs. 8 and 9 for the yacht dataset). Although hyperparameter optimisation for the final regressor is important in general, it depends on the samples selected, thus the method of sampling used, therefore it could be unfair to compare different methods with different hyperparameter optimisations. So we chose 100 trees, which is a safe hyperparameter for all the methods.
Appendix 4: Plots for Sect. 4.2
In this section, we present the two-component PCA plots and response histograms for all the real datasets studied in Sect. 4.2 of the main paper (Figs. 10, 11, 12, 13, 14 and 15). We also show the performance in prediction using our method and compare it with the state-of-the-art with the help of boxplots, that highlight the difference in the RMSE, and also in the variance and the number of outliers.
Appendix 5: Histograms of difference in RMSEs
In this section, we show the histograms of the error differences between our method and the state-of-the-art, to have a visual understanding of Table 2 of the main paper. With these histograms, we see that indeed our method is consistently a good performer for all the datasets. We also highlight that for cases where other approaches may give equivalent results compared to ours, the vales of the difference in errors is very low, thus not statistically significant (Figs. 16, 17, 18, 19, 20 and 21). This can be seen from the peak at 0 on the X axis, suggesting that there are many such cases where other methods may be better than us but not statistically.
Appendix 6: RT-AL with other classes of machine learning models
We use three different models for the final predictor to show how our method can be applied to other models. We show in Fig. 22 the performance in prediction using RMSE averaged over 200 runs for different stages in training using the model-based AL methods GSy, QBC, EMCM, MT, LCMD and our method RT-AL, for the superconductivity dataset (with N = 21,263 and D = 81). RS was used to initialise all the methods (except for GSy where GSx is used as proposed by the authors). The different models we use for the final predictor with hyperparameter optimisation for each run are described below:
-
Linear model Lasso: We use lasso model as implemented in Scikit-learn, while optimising alpha, and keeping the other hyperparameters as default.
-
Random Forest: We used the random forest method (RF) as implemented in Scikit-learn, while optimising the number of estimators in the forest, maximum number of features and minimum samples in the leaf, and keeping the other hyperparameters as default.
-
Multi Layer Perceptron Regressor: We use MLPRegressor (MLP) as implemented in Scikit-learn, while optimising hidden layer sizes, keeping the learning rate adaptive, maximum iterations 1000, setting early stopping to true and the other hyperparameters as default.
From the plots in Fig. 22 we can conclude that the performance of each model-based AL method varies based on the prediction model. However, our method with regression trees, RT-AL, is well positioned compared to other model-based AL algorithms irrespective of the prediction model, thus implying that RT-AL can indeed be applied to other models. Moreover, using random forest as the prediction model leads to the best performance among these three classes of models according to the RMSE score, thus justifying our choice for comparing all the methods. Thus, our method using regression trees is indeed very efficient and robust.
Appendix 7: Computation time
We show in Table 4 the time (in seconds) it takes to label 100 samples for all the datasets (except for orange dataset, where we show the times for 60 labeled samples). Note that for our method, the methods corresponding to the italicised times have been used as initialiser for the respective datasets, while for other model-based methods, RS is the initialiser. We see that our method is indeed very competitive, and most often takes the lowest times, especially among the model-based AL methods. RT-AL is far more efficient than QBC (with trees as models in the committee) which gave a performance close to ours for some datasets. We also would like to mention here that for cases where labeling is expensive, therefore not many samples can be used to construct the training set, it is more crucial to have an accurate model and model complexity may not play a big role in general.
Appendix 8: Illustration on a simulated dataset
It has been shown by Willett et al. (2005) that AL is not always interesting in regression, we try to answer why, and where it is indeed extremely necessary. We argue here that datasets with different distributions of the response and the features are the ones where AL is in fact most beneficial, and rather imperative when labeling is expensive. Keeping this idea in mind, we depict the importance of our method on a simulation that was generated particularly to have different structures for the features and the response (Fig. 23). As most methods focus on diversity in the features, we start our illustration with a simulation so as to control the link between the response and the features. We construct a D-dimensional feature space of N samples, such that it consists of c clusters. The response is then defined as a non-linear function of the first principal component of the features, so as to keep a weak relation between the two distributions.
In Fig. 24, we compare our approach to the RS, GSx and iRDM, for a simulation with \(N = 3000\), \(D = 15\) and \(c = 10\) the number of clusters. As expected, GSx and iRDM both underperform compared to RS because for both these methods, we provide additional information about the features at every step, while missing critical details of the structure of the response. The comparison thus shows that for data with weak correlation between the features and the response, picking points to be labeled without using any prior information at all (using RS) is better than adding details about the features, because it focuses on the wrong subset of points.
However, we see that adding information about the response using our regression trees is indeed helpful. It is clear from the figure that for all the methods in question, we observe a significant improvement by picking the set of points to be labeled using a regression tree that is built on the samples that are already labeled. Further, we also note that GSx + RT (RS) or iRDM + RT (RS) is better for such datasets, than GSx + RT (Diversity-based) or iRDM + RT (Representativity-based). This is because both GSx and iRDM do not work well for such datasets to begin with, so the knowledge they add in each leaf is also damageable. Thus, for datasets such as these, where AL schemes are hoped to be most interesting, we succeed to show that learning with simple regression trees alone reduces the size of the training set to be labeled by a good margin, for a desired level of accuracy. For example, from Fig. 24 we see that on an average, the performance with 120 samples well selected by our trees is the same as the performance with 140 samples uniformly selected, 180 samples selected by iRDM and much more when selecting using GSx.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jose, A., de Mendonça, J.P.A., Devijver, E. et al. Regression tree-based active learning. Data Min Knowl Disc 38, 420–460 (2024). https://doi.org/10.1007/s10618-023-00951-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-023-00951-7