Abstract
The latest technological advances drive the emergence of countless real-time data streams fed by users, sensors, and devices. These data sources can be mined with the help of predictive and classification techniques to support decision-making in fields like e-commerce, industry or health. In particular, stream-based classification is widely used to categorise incoming samples on the fly. However, the distribution of samples per class is often imbalanced, affecting the performance and fairness of machine learning models. To overcome this drawback, this paper proposes Bplug, a balancing plug-in for stream-based classification, to minimise the bias introduced by data imbalance. First, the plug-in determines the class imbalance degree and then synthesises data statistically through non-parametric kernel density estimation. The experiments, performed with real data from Wikivoyage and Metro of Porto, show that Bplug maintains inter-feature correlation and improves classification accuracy. Moreover, it works both online and offline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
If \(O<{20\,000}\), it is advisable that \(n=O/2\), otherwise \(n=O/4\).
- 2.
Available at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html, December 2022.
- 3.
Available at https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html, December 2022.
- 4.
Available from the corresponding author on reasonable request.
- 5.
Available at https://www.wikivoyage.org, December 2022.
References
Abu Alfeilat, H.A., et al.: Effects of distance measure choice on K-nearest neighbor classifier performance: a review. Big Data 7(4), 221–248 (2019)
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6, 20–29 (2004)
Berrar, D.: Cross-validation. In: Encyclopedia of Bioinformatics and Computational Biology, pp. 542–545. Elsevier (2019)
Branco, P., Torgo, L., Ribeiro, R.P.: Pre-processing approaches for imbalanced distributions in regression. Neurocomputing 343, 76–99 (2019)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: Smoteboost: improving prediction of the minority class in boosting. In: Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery, vol. 2838, pp. 107–119. Springer (2003)
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from imbalanced data streams. In: Learning from Imbalanced Data Sets, pp. 279–303. Springer (2018)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42, 463–484 (2012)
García-Méndez, S., et al.: Simulation, modelling and classification of wiki contributors: spotting the good, the bad, and the ugly. Simul. Model. Pract. Theory 120, 102616 (2022)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Korycki, L., Krawczyk, B.: Online oversampling for sparsely labeled imbalanced and non-stationary data streams. In: Proceedings of 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2020)
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Progress Artif. Intell. 5(4), 221–232 (2016)
Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. 409–410, 17–26 (2017)
Meyer, D., Nagler, T.: Synthia: multidimensional synthetic data generation in python. J. Open Source Softw. 6, 2863 (2021)
Nguyen, H.M., Cooper, E.W., Kamei, K.: Online learning from imbalanced data streams. In: Proceedings of 2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR), pp. 347–352. IEEE (2011)
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: Proceedings of 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410. IEEE (2016)
Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. - Part A: Syst. Hum. 40, 185–197 (2010)
Sun, Y., Wong, A.K., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recognit Artif Intell. 23, 687–719 (2009)
Wȩglarczyk, S.: Kernel density estimation and its application. ITM Web of Conferences 23, 1–8 (2018)
Acknowledgements
This work was partially supported by: (i) Xunta de Galicia grant ED481B-2021-118 and ED481B-2022-093, Spain; and (ii) Portuguese National Funds through the FCT - Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) as part of project UIDB/50014/2020 (DOI: 10.54499/UIDP/50014/2020 | https://doi.org/10.54499/UIDP/50014/2020).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
de Arriba-Pérez, F., García-Méndez, S., Leal, F., Malheiro, B., Burguillo-Rial, J.C. (2024). Balancing Plug-In for Stream-Based Classification. In: Rocha, A., Adeli, H., Dzemyda, G., Moreira, F., Colla, V. (eds) Information Systems and Technologies. WorldCIST 2023. Lecture Notes in Networks and Systems, vol 799. Springer, Cham. https://doi.org/10.1007/978-3-031-45642-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-45642-8_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45641-1
Online ISBN: 978-3-031-45642-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)