Skip to main content

Load Balancing for Imbalanced Data Sets: Classifying Scientific Artefacts for Evidence Based Medicine

  • Conference paper
PRICAI 2014: Trends in Artificial Intelligence (PRICAI 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8862))

Included in the following conference series:

Abstract

Data skewness is a challenge encountered, in particular, when applying supervised machine learning approaches in various domains, such as in healthcare and biomedical information engineering. Evidence Based Medicine (EBM) is a clinical strategy for prescribing treatment based on current best evidence for individual patients. Clinicians need to query publication repositories in order to find the best evidence to support their decision-making processes. This sophisticated information is materialised in the form of scientific artefacts in scholarly publications and the automatic extraction of these artefacts is a technical challenge for current generic search engines. Many classification approaches have been proposed for identifying key scientific artefacts in EBM, however their performance is affected by the imbalanced characteristic of data in this domain. In this paper, we present four data balancing approaches applied in a binary ensemble classifier framework for classifying scientific artefacts in the EBM domain. Our balancing approaches improve the ensemble classifier’s F-score by up to 15% for classes of scientific artefacts with extremely low coverage in the domain. In addition, we propose a classifier selection method for choosing the best classifier based on the distributional feature of classes. The resulting classifiers show improved classification performances when compared to state of the art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 475–482. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  • Chawla, N.V.: Data Mining for Imbalanced Datasets: An Overview. In: Data Mining and Knowledge Discovery Handbook, 2nd edn., pp. 875–886 (2010)

    Google Scholar 

  • Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)

    MATH  Google Scholar 

  • de Souto, M.C.P., Bittencourt, V.G., Costa, J.A.F.: An empirical analysis of under-sampling techniques to balance a protein structural class dataset. In: King, I., Wang, J., Chan, L.-W., Wang, D. (eds.) ICONIP 2006. Part III, LNCS, vol. 4234, pp. 21–29. Springer, Heidelberg (2006)

    Google Scholar 

  • Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews 42(4), 463–484 (2012)

    Article  Google Scholar 

  • Hassanzadeh, H., Groza, T., Hunter, J.: Identifying scientific artefacts in biomedical literature: The Evidence Based Medicine use case. J. Biomed. Inform. 49, 159–170 (2014)

    Article  Google Scholar 

  • Khalilia, M., Chakraborty, S., Popescu, M.: Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inform. Decis. Mak. 11 (2011)

    Google Scholar 

  • Kim, S.N., Martinez, D., Cavedon, L., Yencken, L.: Automatic classification of sentences to support Evidence Based Medicine. BMC Bioinformatics 12(suppl. 2), S5 (2011)

    Google Scholar 

  • Liakata, M., Saha, S., Dobnik, S., Batchelor, C., Rebholz-Schuhmann, D.: Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28(7), 991–1000 (2012)

    Article  Google Scholar 

  • McCallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002), http://mallet.cs.umass.edu (retrieved)

  • Nakamura, M., Kajiwara, Y., Otsuka, A., Kimura, H.: LVQ-SMOTE - Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data. Biodata Mining 6 (2013)

    Google Scholar 

  • Sarker, A., Molla, D., Paris, C.: An Approach for Automatic Multi-label Classification of Medical Sentences. In: Proceedings of the 4th International Louhi Workshop on Health Document Text Mining and Information Analysis (2013)

    Google Scholar 

  • Verbeke, M., Asch, V.V., Morante, R., Frasconi, P., Daelemans, W., Raedt, L.D.: A statistical relational learning approach to identifying evidence based medicine categories. Paper Presented at the Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea (2012)

    Google Scholar 

  • Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications 36(3) (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Hassanzadeh, H., Groza, T., Nguyen, A., Hunter, J. (2014). Load Balancing for Imbalanced Data Sets: Classifying Scientific Artefacts for Evidence Based Medicine. In: Pham, DN., Park, SB. (eds) PRICAI 2014: Trends in Artificial Intelligence. PRICAI 2014. Lecture Notes in Computer Science(), vol 8862. Springer, Cham. https://doi.org/10.1007/978-3-319-13560-1_84

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13560-1_84

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13559-5

  • Online ISBN: 978-3-319-13560-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics