Balancing Plug-In for Stream-Based Classification

de Arriba-Pérez, Francisco; García-Méndez, Silvia; Leal, Fátima; Malheiro, Benedita; Burguillo-Rial, Juan Carlos

doi:10.1007/978-3-031-45642-8_6

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 799))

Included in the following conference series:

World Conference on Information Systems and Technologies

77 Accesses

Abstract

The latest technological advances drive the emergence of countless real-time data streams fed by users, sensors, and devices. These data sources can be mined with the help of predictive and classification techniques to support decision-making in fields like e-commerce, industry or health. In particular, stream-based classification is widely used to categorise incoming samples on the fly. However, the distribution of samples per class is often imbalanced, affecting the performance and fairness of machine learning models. To overcome this drawback, this paper proposes Bplug, a balancing plug-in for stream-based classification, to minimise the bias introduced by data imbalance. First, the plug-in determines the class imbalance degree and then synthesises data statistically through non-parametric kernel density estimation. The experiments, performed with real data from Wikivoyage and Metro of Porto, show that Bplug maintains inter-feature correlation and improves classification accuracy. Moreover, it works both online and offline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
If \(O<{20\,000}\), it is advisable that \(n=O/2\), otherwise \(n=O/4\).
2.
Available at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html, December 2022.
3.
Available at https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html, December 2022.
4.
Available from the corresponding author on reasonable request.
5.
Available at https://www.wikivoyage.org, December 2022.

References

Abu Alfeilat, H.A., et al.: Effects of distance measure choice on K-nearest neighbor classifier performance: a review. Big Data 7(4), 221–248 (2019)
Article Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6, 20–29 (2004)
Google Scholar
Berrar, D.: Cross-validation. In: Encyclopedia of Bioinformatics and Computational Biology, pp. 542–545. Elsevier (2019)
Google Scholar
Branco, P., Torgo, L., Ribeiro, R.P.: Pre-processing approaches for imbalanced distributions in regression. Neurocomputing 343, 76–99 (2019)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: Smoteboost: improving prediction of the minority class in boosting. In: Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery, vol. 2838, pp. 107–119. Springer (2003)
Google Scholar
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from imbalanced data streams. In: Learning from Imbalanced Data Sets, pp. 279–303. Springer (2018)
Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42, 463–484 (2012)
Google Scholar
García-Méndez, S., et al.: Simulation, modelling and classification of wiki contributors: spotting the good, the bad, and the ugly. Simul. Model. Pract. Theory 120, 102616 (2022)
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Korycki, L., Krawczyk, B.: Online oversampling for sparsely labeled imbalanced and non-stationary data streams. In: Proceedings of 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2020)
Google Scholar
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Progress Artif. Intell. 5(4), 221–232 (2016)
Article Google Scholar
Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. 409–410, 17–26 (2017)
Article Google Scholar
Meyer, D., Nagler, T.: Synthia: multidimensional synthetic data generation in python. J. Open Source Softw. 6, 2863 (2021)
Article Google Scholar
Nguyen, H.M., Cooper, E.W., Kamei, K.: Online learning from imbalanced data streams. In: Proceedings of 2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR), pp. 347–352. IEEE (2011)
Google Scholar
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: Proceedings of 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410. IEEE (2016)
Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. - Part A: Syst. Hum. 40, 185–197 (2010)
Article Google Scholar
Sun, Y., Wong, A.K., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recognit Artif Intell. 23, 687–719 (2009)
Article Google Scholar
Wȩglarczyk, S.: Kernel density estimation and its application. ITM Web of Conferences 23, 1–8 (2018)
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by: (i) Xunta de Galicia grant ED481B-2021-118 and ED481B-2022-093, Spain; and (ii) Portuguese National Funds through the FCT - Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) as part of project UIDB/50014/2020 (DOI: 10.54499/UIDP/50014/2020 | https://doi.org/10.54499/UIDP/50014/2020).

Author information

Authors and Affiliations

Information Technologies Group, atlanTTic, University of Vigo, Vigo, Spain
Francisco de Arriba-Pérez, Silvia García-Méndez & Juan Carlos Burguillo-Rial
REMIT, Universidade Portucalense, Porto, Portugal
Fátima Leal
INESC TEC, Porto, Portugal
Benedita Malheiro
ISEP, Polytechnic Institute of Porto, Porto, Portugal
Benedita Malheiro

Authors

Francisco de Arriba-Pérez
View author publications
You can also search for this author in PubMed Google Scholar
Silvia García-Méndez
View author publications
You can also search for this author in PubMed Google Scholar
Fátima Leal
View author publications
You can also search for this author in PubMed Google Scholar
Benedita Malheiro
View author publications
You can also search for this author in PubMed Google Scholar
Juan Carlos Burguillo-Rial
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benedita Malheiro .

Editor information

Editors and Affiliations

ISEG, Universidade de Lisboa, Lisbon, Cávado, Portugal
Alvaro Rocha
College of Engineering, The Ohio State University, Columbus, OH, USA
Hojjat Adeli
Institute of Data Science and Digital Technologies, Vilnius University, Vilnius, Lithuania
Gintautas Dzemyda
DCT, Universidade Portucalense, Porto, Portugal
Fernando Moreira
TeCIP Institute, Scuola Superiore Sant’Anna, Pisa, Italy
Valentina Colla

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Arriba-Pérez, F., García-Méndez, S., Leal, F., Malheiro, B., Burguillo-Rial, J.C. (2024). Balancing Plug-In for Stream-Based Classification. In: Rocha, A., Adeli, H., Dzemyda, G., Moreira, F., Colla, V. (eds) Information Systems and Technologies. WorldCIST 2023. Lecture Notes in Networks and Systems, vol 799. Springer, Cham. https://doi.org/10.1007/978-3-031-45642-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-45642-8_6
Published: 16 February 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45641-1
Online ISBN: 978-3-031-45642-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Balancing Plug-In for Stream-Based Classification