Skip to main content
Log in

Diagnosis system for imbalanced multi-minority medical dataset

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Medical datasets inherently suffer from imbalance problem. Occurrence of some of the sub-pathologies is scarce than the other. In this work, a disease diagnosis system for multiclass classification is developed. Hybrid synthetic sampling technique is used for extremely imbalanced datasets. Cluster-based self-class algorithm is proposed in this work. Compared to near miss algorithm, this exhibits equivalent performance with reduced time for sampling. The results of classification are compared across baseline approaches which do not consider clustering and synthetic sampling. A new technique based on confidence measure is proposed to evaluate test samples by OVO classifiers. This technique along with hybrid sampling suggests an improvement over the classical approaches currently used in disease diagnosis systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Ahmadi MA, Bahadori A (2015) A LSSVM approach for determining well placement and conning phenomena in horizontal wells. Fuel 153:276–283

    Article  Google Scholar 

  • Ahmadi MA, Masoumi M, Askarinezhad R (2014) Evolving connectionist model to monitor the efficiency of an in situ combustion process: application to heavy oil recovery. J Energy Technol 2(2014):811–818. https://doi.org/10.1002/ente.201402043

    Article  Google Scholar 

  • Ahmadi MA, Masoumi M, Askarinezhad R (2015a) Evolving smart model to predict the combustion front velocity for in situ combustion. J Energy Technol. https://doi.org/10.1002/ente.201402104

    Google Scholar 

  • Ahmadi MH et al (2015b) Connectionist intelligent model estimates output power and torque of stirling engine. Renew Sustain Energy Rev 50:871–883. https://doi.org/10.1016/j.rser.2015.04.185

    Article  Google Scholar 

  • Ali M, Ebadi M (2014) Evolving smart approach for determination dew point pressure through condensate gas reservoirs. Fuel 117:1074–1084. https://doi.org/10.1016/j.fuel.2013.10.010

    Article  Google Scholar 

  • Ali M, Ebadi M, Soleimani P (2014) Evolving predictive model to determine condensate-to-gas ratio in retrograded condensate gas reservoirs. Fuel 124:241–257. https://doi.org/10.1016/j.fuel.2014.01.073

    Article  Google Scholar 

  • Ali M et al (2015) Connectionist model for predicting minimum gas miscibility pressure: application to gas injection process. Fuel. https://doi.org/10.1016/j.fuel.2015.01.044

    Google Scholar 

  • Almogahed BA, Kakadiaris IA (2014) Empowering imbalanced data in supervised learning a semi-supervised learning approach. In: Artificial neural networks and machine learning. ICANN Springer International Publishing (September 2014), pp 523–530. https://doi.org/10.1007/978-3-319-11179-7_66

  • Anooj PK (2012) Clinical decision support system: risk level prediction of heart disease using weighted fuzzy rules. J King Saud Univ Comput Inf Sci 24(1):27–40. https://doi.org/10.1016/j.jksuci.2011.09.002

    Google Scholar 

  • Arias-Londono JD, Godino-Llorente JI, Saenz-Lechon N, Osma-Ruiz V, Castellanos-Dominguez G (2010) An improved method for voice pathology detection by means of a HMM-based feature space transformation. Pattern Recognit 43(9):3100–3112. https://doi.org/10.1016/j.patcog.2010.03.019

    Article  MATH  Google Scholar 

  • Arias-Londono JD, Godino-Llorente JI, Markaki M, Stylianou Y (2011) On combining information from modulation spectra and mel-frequency cepstral coefficients for automatic detection of pathological voices. Logop Phoniatr Vocol 36(2):60–69. https://doi.org/10.3109/14015439.2010.528788

    Article  Google Scholar 

  • Autio L, Juhola M, Laurikkala J (2007) On the neural network classification of medical data and an endeavour to balance non-uniform data sets with artificial data extension. Comput Biol Med 37(3):388–397

    Article  Google Scholar 

  • Barry WJ, Putzer M (2007) Saarbrucken voice database. Institute of Phonetics University of Saarland. http://www.stimmdatenbank.coli.uni-saarland.de/

  • Bhatia S, Prakash P, Pillai GN (2008) SVM based decision support system for heart disease classification with integer-coded genetic algorithm to select critical features. In: Proceedings of the world congress on engineering and computer science, WCECS 2008, pp 22–24

  • Brown I, Mues C (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst Appl 39(3):3446–3453. https://doi.org/10.1016/j.eswa.2011.09.033

    Article  Google Scholar 

  • Chawla NV (2010) Data mining and knowledge discovery handbook. Springer, New York, pp 875–886

    Google Scholar 

  • Chawla NV, Nathalie J, Aleksander K (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 6(1):1–6. https://doi.org/10.1145/1007730.1007733

    Article  Google Scholar 

  • Das B, Krishnan NC, Cook DJ (2015) RACOG and wRACOG: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234

    Article  Google Scholar 

  • Dubey R, Zhou J, Wang Y, Paul M (2014) Analysis of sampling techniques for imbalanced data: an n = 648 ADNI study. NeuroImage 87:220–241. https://doi.org/10.1016/j.neuroimage.2013.10.005

    Article  Google Scholar 

  • Ertekin S (2013) Adaptive oversampling for imbalanced data classification. Inf Sci Syst 264:261–269. https://doi.org/10.1007/978-3-319-01604-726

    Google Scholar 

  • Fernández A, del Río S, Chawla NV (2017) An insight into imbalanced Big Data classification: outcomes and challenges. Complex Intell Syst 3:105. https://doi.org/10.1007/s40747-017-0037-9

    Article  Google Scholar 

  • Godino-Llorente JI, Gomez-Vilda P, Cruz-Roldan F, Blanco-Velasco M, Fraile R (2010) Pathological likelihood index as a measurement of the degree of voice normality and perceived hoarseness. J Voice 24(6):667–677. https://doi.org/10.1016/j.jvoice.2009.04.003

    Article  Google Scholar 

  • Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explor Newsl 6(1):30–39. https://doi.org/10.1145/1007730.1007736

    Article  Google Scholar 

  • Juhola M, Viikki K, Laurikkala J, Pyykko I, Kentala E (2001) On classification capability of neural networks: a case study with otoneurological data. Stud Health Technol Inform 1:474–478

    Google Scholar 

  • Kohli N, Verma NK, Roy A (2010) SVM based methods for arrhythmia classification in ECG. In: 2010 international conference on computer and communication technology (ICCCT), pp 486–490. IEEE

  • Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232

    Article  Google Scholar 

  • Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml

  • Liu A, Ghosh J, Martin CE (2007) Generative oversampling for mining imbalanced datasets. In: Proceedings of the 2007 international conference on data mining, DMIN2007, 25–28 June 2007, Las Vegas, Nevada, USA, pp 66–72

  • Markaki ME, Stylianou Y (2009) Normalized modulation spectral features for cross-database voice pathology detection. In: ISCA INTERSPEECH, pp 935–938. http://dblp.uni-trier.de/db/conf/interspeech/interspeech2009.html#MarkakiS09

  • Marqués Marzal AI, Garc’ıa Jim’enez V, Sánchez Garreta JS (2013) On the suitability of resampling techniques for the class imbalance problem in credit scoring. J Oper Res Soc 64(7):1060–1070

    Article  Google Scholar 

  • Martinez GD, Eduardo L, Alfonso O, Antonio M (2012a) Score level versus audio level fusion for voice pathology detection on the Saarbrucken voice database. In: Advances in speech and language technologies for Iberian languages—Iber SPEECH, 2012 conference, Madrid, Spain, 21–23 Nov 2012. Proceedings, pp 110–120. https://doi.org/10.1007/978-3-642-35292-8_12

  • Martinez GD, Lleida E, Ortega A, Miguel A, Villalba JA (2012b) Voice pathology detection on the Saarbrucken voice database with calibration and fusion of scores using multifocal toolkit. In: Advances in speech and language technologies for Iberian languages—IberSPEECH 2012 conference, Madrid, Spain, 21–23 Nov 2012. Proceedings, pp 99–109. https://doi.org/10.1007/978-3-642-35292-8_11

  • Naganjaneyulu S, Kuppa MR, Mirza A (2014) An efficient wrapper approach for class imbalance learning using intelligent under-sampling. Int J Artif Intell Appl Smart Dev 2(1):23–40. https://doi.org/10.14257/ijaiasd.2014.2.1.03

    Google Scholar 

  • Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203

    Article  Google Scholar 

  • Shilaskar S, Ghatol A, Chatur P (2016) Medical decision support system for extremely imbalanced datasets. Inf Sci. https://doi.org/10.1016/j.ins.2016.08.077

    Google Scholar 

  • Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437

    Article  Google Scholar 

  • Sug H, Dankel II DD (2014) More reliable over-sampled synthetic data instances by using artificial neural networks for a minority class. In: Proceedings of the 2014 world congress in computer science, computer engineering, and applied computing (July 2014). http://worldcomp-proceedings.com/proc/p2014/DMI.html

  • Tang Y, Zhang Y-Q, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern Part B Cybern 39(1):281–288

    Article  Google Scholar 

  • Teixeira JP, Fernandes PO (2014) Jitter, shimmer and HNR classification within gender, tones and vowels in healthy voices. Procedia Technol 16(2014):1228–1237

    Article  Google Scholar 

  • Van Asch (2013) Macro- and micro-averaged evaluation measures. Available: www.cnts.ua.ac.be/~vincent/pdf/microaverage.pdf

  • Varpa K, Iltanen K, Juhola M (2014) Genetic algorithm based approach in attribute weighting for a medical data set. J Comput Med

  • Wang Q (2014) A hybrid sampling SVM approach to imbalanced data classification. Abstr Appl Anal. https://doi.org/10.1155/2014/972786

    MATH  Google Scholar 

  • Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727

    Article  MathSciNet  Google Scholar 

  • Zhang J, Mani I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Workshop on learning from imbalanced datasets II ICML Washington, DC, pp 42–48

  • Zhang YP, Zhang LN, Wang YC (2010) Cluster-based majority under-sampling approaches for class imbalance learning. In: 2010 2nd ieee international conference on information and financial engineering (ICIFE), pp 400–404. IEEE

  • Zhang ZL, Luo XG, García S, Herrera F (2017) Cost-sensitive back-propagation neural networks with binarization techniques in addressing multiclass problems and non-competent classifiers. Appl Soft Comput J. https://doi.org/10.1016/j.asoc.2017.03.016

    Google Scholar 

  • Zheng Y, Yi X, Li M, Li R, Shan Z, Chang E, Li T (2015) Forecasting fine-grained air quality based on big data. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’15). ACM, New York, NY, pp 2267–2276. https://doi.org/10.1145/2783258.2788573

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Swati Shilaskar.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflicts of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Human and animals rights statement

This article does not contain any studies with animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shilaskar, S., Ghatol, A. Diagnosis system for imbalanced multi-minority medical dataset. Soft Comput 23, 4789–4799 (2019). https://doi.org/10.1007/s00500-018-3133-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3133-x

Keywords

Navigation