Skip to main content

Balancing Plug-In for Stream-Based Classification

  • Conference paper
  • First Online:
Information Systems and Technologies (WorldCIST 2023)

Abstract

The latest technological advances drive the emergence of countless real-time data streams fed by users, sensors, and devices. These data sources can be mined with the help of predictive and classification techniques to support decision-making in fields like e-commerce, industry or health. In particular, stream-based classification is widely used to categorise incoming samples on the fly. However, the distribution of samples per class is often imbalanced, affecting the performance and fairness of machine learning models. To overcome this drawback, this paper proposes Bplug, a balancing plug-in for stream-based classification, to minimise the bias introduced by data imbalance. First, the plug-in determines the class imbalance degree and then synthesises data statistically through non-parametric kernel density estimation. The experiments, performed with real data from Wikivoyage and Metro of Porto, show that Bplug maintains inter-feature correlation and improves classification accuracy. Moreover, it works both online and offline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    If \(O<{20\,000}\), it is advisable that \(n=O/2\), otherwise \(n=O/4\).

  2. 2.

    Available at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html, December 2022.

  3. 3.

    Available at https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html, December 2022.

  4. 4.

    Available from the corresponding author on reasonable request.

  5. 5.

    Available at https://www.wikivoyage.org, December 2022.

References

  1. Abu Alfeilat, H.A., et al.: Effects of distance measure choice on K-nearest neighbor classifier performance: a review. Big Data 7(4), 221–248 (2019)

    Article  Google Scholar 

  2. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6, 20–29 (2004)

    Google Scholar 

  3. Berrar, D.: Cross-validation. In: Encyclopedia of Bioinformatics and Computational Biology, pp. 542–545. Elsevier (2019)

    Google Scholar 

  4. Branco, P., Torgo, L., Ribeiro, R.P.: Pre-processing approaches for imbalanced distributions in regression. Neurocomputing 343, 76–99 (2019)

    Article  Google Scholar 

  5. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  6. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: Smoteboost: improving prediction of the minority class in boosting. In: Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery, vol. 2838, pp. 107–119. Springer (2003)

    Google Scholar 

  7. Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from imbalanced data streams. In: Learning from Imbalanced Data Sets, pp. 279–303. Springer (2018)

    Google Scholar 

  8. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42, 463–484 (2012)

    Google Scholar 

  9. García-Méndez, S., et al.: Simulation, modelling and classification of wiki contributors: spotting the good, the bad, and the ugly. Simul. Model. Pract. Theory 120, 102616 (2022)

    Article  Google Scholar 

  10. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  11. Korycki, L., Krawczyk, B.: Online oversampling for sparsely labeled imbalanced and non-stationary data streams. In: Proceedings of 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2020)

    Google Scholar 

  12. Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Progress Artif. Intell. 5(4), 221–232 (2016)

    Article  Google Scholar 

  13. Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. 409–410, 17–26 (2017)

    Article  Google Scholar 

  14. Meyer, D., Nagler, T.: Synthia: multidimensional synthetic data generation in python. J. Open Source Softw. 6, 2863 (2021)

    Article  Google Scholar 

  15. Nguyen, H.M., Cooper, E.W., Kamei, K.: Online learning from imbalanced data streams. In: Proceedings of 2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR), pp. 347–352. IEEE (2011)

    Google Scholar 

  16. Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: Proceedings of 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410. IEEE (2016)

    Google Scholar 

  17. Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. - Part A: Syst. Hum. 40, 185–197 (2010)

    Article  Google Scholar 

  18. Sun, Y., Wong, A.K., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recognit Artif Intell. 23, 687–719 (2009)

    Article  Google Scholar 

  19. Wȩglarczyk, S.: Kernel density estimation and its application. ITM Web of Conferences 23, 1–8 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by: (i) Xunta de Galicia grant ED481B-2021-118 and ED481B-2022-093, Spain; and (ii) Portuguese National Funds through the FCT - Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) as part of project UIDB/50014/2020 (DOI: 10.54499/UIDP/50014/2020 | https://doi.org/10.54499/UIDP/50014/2020).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benedita Malheiro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

de Arriba-Pérez, F., García-Méndez, S., Leal, F., Malheiro, B., Burguillo-Rial, J.C. (2024). Balancing Plug-In for Stream-Based Classification. In: Rocha, A., Adeli, H., Dzemyda, G., Moreira, F., Colla, V. (eds) Information Systems and Technologies. WorldCIST 2023. Lecture Notes in Networks and Systems, vol 799. Springer, Cham. https://doi.org/10.1007/978-3-031-45642-8_6

Download citation

Publish with us

Policies and ethics