Load Balancing for Imbalanced Data Sets: Classifying Scientific Artefacts for Evidence Based Medicine

Hassanzadeh, Hamed; Groza, Tudor; Nguyen, Anthony; Hunter, Jane

doi:10.1007/978-3-319-13560-1_84

Hamed Hassanzadeh^21,22,
Tudor Groza²¹,
Anthony Nguyen²² &
…
Jane Hunter²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8862))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

6396 Accesses
4 Citations
2 Altmetric

Abstract

Data skewness is a challenge encountered, in particular, when applying supervised machine learning approaches in various domains, such as in healthcare and biomedical information engineering. Evidence Based Medicine (EBM) is a clinical strategy for prescribing treatment based on current best evidence for individual patients. Clinicians need to query publication repositories in order to find the best evidence to support their decision-making processes. This sophisticated information is materialised in the form of scientific artefacts in scholarly publications and the automatic extraction of these artefacts is a technical challenge for current generic search engines. Many classification approaches have been proposed for identifying key scientific artefacts in EBM, however their performance is affected by the imbalanced characteristic of data in this domain. In this paper, we present four data balancing approaches applied in a binary ensemble classifier framework for classifying scientific artefacts in the EBM domain. Our balancing approaches improve the ensemble classifier’s F-score by up to 15% for classes of scientific artefacts with extremely low coverage in the domain. In addition, we propose a classifier selection method for choosing the best classifier based on the distributional feature of classes. The resulting classifiers show improved classification performances when compared to state of the art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 475–482. Springer, Heidelberg (2009)
Chapter Google Scholar
Chawla, N.V.: Data Mining for Imbalanced Datasets: An Overview. In: Data Mining and Knowledge Discovery Handbook, 2nd edn., pp. 875–886 (2010)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
de Souto, M.C.P., Bittencourt, V.G., Costa, J.A.F.: An empirical analysis of under-sampling techniques to balance a protein structural class dataset. In: King, I., Wang, J., Chan, L.-W., Wang, D. (eds.) ICONIP 2006. Part III, LNCS, vol. 4234, pp. 21–29. Springer, Heidelberg (2006)
Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews 42(4), 463–484 (2012)
Article Google Scholar
Hassanzadeh, H., Groza, T., Hunter, J.: Identifying scientific artefacts in biomedical literature: The Evidence Based Medicine use case. J. Biomed. Inform. 49, 159–170 (2014)
Article Google Scholar
Khalilia, M., Chakraborty, S., Popescu, M.: Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inform. Decis. Mak. 11 (2011)
Google Scholar
Kim, S.N., Martinez, D., Cavedon, L., Yencken, L.: Automatic classification of sentences to support Evidence Based Medicine. BMC Bioinformatics 12(suppl. 2), S5 (2011)
Google Scholar
Liakata, M., Saha, S., Dobnik, S., Batchelor, C., Rebholz-Schuhmann, D.: Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28(7), 991–1000 (2012)
Article Google Scholar
McCallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002), http://mallet.cs.umass.edu (retrieved)
Nakamura, M., Kajiwara, Y., Otsuka, A., Kimura, H.: LVQ-SMOTE - Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data. Biodata Mining 6 (2013)
Google Scholar
Sarker, A., Molla, D., Paris, C.: An Approach for Automatic Multi-label Classification of Medical Sentences. In: Proceedings of the 4th International Louhi Workshop on Health Document Text Mining and Information Analysis (2013)
Google Scholar
Verbeke, M., Asch, V.V., Morante, R., Frasconi, P., Daelemans, W., Raedt, L.D.: A statistical relational learning approach to identifying evidence based medicine categories. Paper Presented at the Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea (2012)
Google Scholar
Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications 36(3) (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

School of ITEE, The University of Queensland, Australia
Hamed Hassanzadeh, Tudor Groza & Jane Hunter
The Australian e-Health Research Centre, Brisbane, Queensland, Australia
Hamed Hassanzadeh & Anthony Nguyen

Authors

Hamed Hassanzadeh
View author publications
You can also search for this author in PubMed Google Scholar
Tudor Groza
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Jane Hunter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

MIMOS Berhad Technology Park Malaysia, 57000, Bukit Jalil, KL, Malaysia
Duc-Nghia Pham
Kyungpook National University, Sankyuk-Dong, Buk-Gu, 702-701, Daegu, Korea
Seong-Bae Park

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hassanzadeh, H., Groza, T., Nguyen, A., Hunter, J. (2014). Load Balancing for Imbalanced Data Sets: Classifying Scientific Artefacts for Evidence Based Medicine. In: Pham, DN., Park, SB. (eds) PRICAI 2014: Trends in Artificial Intelligence. PRICAI 2014. Lecture Notes in Computer Science(), vol 8862. Springer, Cham. https://doi.org/10.1007/978-3-319-13560-1_84

Download citation

DOI: https://doi.org/10.1007/978-3-319-13560-1_84
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13559-5
Online ISBN: 978-3-319-13560-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics