Abstract
There are several diseases that the world faces presently and a critical one is Diabetes mellitus. The current diagnostic practice involves various tests at a lab or a hospital and a treatment based on the outcome of the diagnosis. This study proposes a machine learning model to classify a patient as diabetic or not, utilizing the popular PIMA Indian Dataset. The dataset contains features like Pregnancy, Blood Pressure, Skin Thickness, Age and Diabetes Pedigree Function along with regular factors like Glucose, BMI and Insulin. The objective of this study is to make use of several pre-processing techniques resulting in improved accuracy over simple models. The study compares different classification models namely GaussianNB, Logistic Regression, KNN, Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier in several ways. Initially, missing values in the significant features are replaced by computing median of the input variables based on the outcome of whether the patient is diabetic or not. After this, feature engineering is performed by adding new features which are obtained by categorizing the existing features based on its range. Finally, Hyperparameter tuning is carried out to optimize the model. Performance metrics such as Accuracy and area under the ROC Curve (AUC) is used to validate the effectiveness of the proposed framework. Results indicate that XGBoosting Classifier is concluded as the optimum model with 88% accuracy and AUC value of 0.948. The performance of the model is evaluated using Confusion Matrix and ROC Curve.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cho, N.H., Shaw, J.E., Karuranga, S., Huang, Y., da Rocha Fernandes, J.D., Ohlrogge, A.W., Malanda, B.: IDF diabetes atlas: global estimates of diabetes prevalence for 2017 and projections for 2045. Diabet. Res. Clin. Pract. 138, 271–281 (2018). https://doi.org/10.1016/j.diabres.2018.02.023
Saeedi, P., Petersohn, I., Salpea, P., Malanda, B., Karuranga, S., Unwin, N., Colagiuri, S., Guariguata, L., Motala, A. A., Ogurtsova, K., Shaw, J. E., Bright, D., Williams, R., IDF Diabetes Atlas Committee: Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas, 9th edn. Diabet. Res. Clin. Pract. 157, 107843 (2019). https://doi.org/10.1016/j.diabres.2019.107843
Maniruzzaman, M., Kumar, N., Menhazul Abedin, M., et al.: Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm. Comput. Methods Progr. Biomed. 152, 23–34 (2017). https://doi.org/10.1016/j.cmpb.2017.09.004
Komi, M., Li, J., Zhai, Y., Zhang, X.: Application of data mining methods in diabetes prediction. In: 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, pp. 1006–1010 (2017). https://doi.org/10.1109/ICIVC.2017.7984706
Mercaldo, F., Nardone, V., Santone, A.: Diabetes mellitus affected patients classification and diagnosis through machine learning techniques. Procedia Comput. Sci. 112, 2519–2528 (2017). https://doi.org/10.1016/j.procs.2017.08.193
Sisodia, D., Sisodia, D. S.: Prediction of diabetes using classification algorithms. Procedia Comput. Sci. 132, 1578–1585. Elsevier B.V(2018). https://doi.org/10.1016/j.procs.2018.05.122
Hasan Md, A., Md. Ashraful, Das, D., Hossain, E., Hasan, M.: Diabetes prediction using ensembling of different machine learning classifiers. IEEE Access, 1–1 (2020) https://doi.org/10.1109/ACCESS.2020.2989857
Alehegn, M., Raghvendra Joshi, R., Mulay, R.: Diabetes analysis and prediction using random forest, KNN, Naïve Bayes, And J48: an ensemble approach. Int. J. Sci. Technol. Res. 8, 09 (2019)
Sneha, N., Gangil, T.: Analysis of diabetes mellitus for early prediction using optimal features selection. J. Big Data 6, 13 (2019). https://doi.org/10.1186/s40537-019-0175-6
Hina, S., Shaikh, A., Sattar, S.A.: Analyzing diabetes datasets using data mining. J. Basic Appl. Sci. 13, 466–471 (2017)
Asuero, A.G., Sayago, A., Gonzalez, A.: The correlation coefficient: an overview, Crit. Rev. Anal. Chem. 36, 41–59 (2006)
Markovitch, S., Rosenstein, D.: Feature generation using general constructor functions. Mach. Learn. 49, 59–98 (2002). https://doi.org/10.1023/A:1014046307775
Ünsal, Ö., Bulbul, H.: Comparison of classification techniques used in machine learning as applied on vocational guidance data. In: International Conference on Machine Learning and Applications, vol. 10 (2011)
Zeng, X., Martinez, T.R.: Distribution-balanced stratified cross validation for accuracy estimation. J. Exp. Theor. Artif. Intell. 12, 1–12 (2000)
Mitchell, T.M., et al.: Machine Learning, vol. 45.37. McGraw Hill, Burr Ridge, IL, pp. 870–877 (1997)
Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recogn. 45, 3084–3104, ISSN 0031-3203 (2012). https://doi.org/10.1016/j.patcog.2012.03.004
Peng, C.-Y.J., Lee, K.L., Ingersoll, G.M.: An introduction to logistic regression analysis and reporting. J. Educ. Res. 96, 3–14 (2002). https://doi.org/10.1080/00220670209598786
Cheng, W., Hüllermeier, E.: Combining instance-based learning and logistic regression for multilabel classification. Mach. Learn. 76, 211–225 (2009). https://doi.org/10.1007/s10994-009-5127-5
Özkan, Y.: Data Mining Methods. Papatya Publications, Istanbul, Turkey (2008)
Raj, J.S.: A novel information processing in IoT based real time health care monitoring system. J. Electron. 2(3), 188–196 (2020)
Raj, J.S., Ananthi, J.V: Recurrent neural networks and nonlinear prediction in support vector machines. J. Soft Comput. Paradigm (JSCP) 1(1), 33–40 (2019)
Ross Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1993)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Liaw, A., Wiener, M.: Classification and regression by random forest. R news 2, 18–22 (2002)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.A.: Review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. 42(4), 463–484 (2012). https://doi.org/10.1109/TSMCC.2011.2161285
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16). Association for Computing Machinery, New York, NY, USA, 785–794 (2016). https://doi.org/10.1145/2939672.2939785
Melo, F.: Area under the ROC Curve. In:Dubitzky, W., Wolkenhauer, O., Cho, K.H., Yokota, H. (eds.) Encyclopedia of Systems Biology. Springer, New York (2013). https://doi.org/10.1007/978-1-4419-9863-7_209
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
James, D.E., Vimina, E.R. (2022). Machine Learning-Based Early Diabetes Prediction. In: Raj, J.S., Palanisamy, R., Perikos, I., Shi, Y. (eds) Intelligent Sustainable Systems. Lecture Notes in Networks and Systems, vol 213. Springer, Singapore. https://doi.org/10.1007/978-981-16-2422-3_52
Download citation
DOI: https://doi.org/10.1007/978-981-16-2422-3_52
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2421-6
Online ISBN: 978-981-16-2422-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)