Comparison of Discriminant Analysis and Support Vector Machine on Mixed Categorical and Continuous Independent Variables for COVID-19 Patients Data

Husnul Aris Haikal(1), Aji Hamim Wigena(2), Kusman Sadik(3), Efriwati Efriwati(4),


(1) Faculty of Mathematics and Natural Sciences, Institut Pertanian Bogor, Indonesia
(2) Faculty of Mathematics and Natural Sciences, Institut Pertanian Bogor, Indonesia
(3) Faculty of Mathematics and Natural Sciences, Institut Pertanian Bogor, Indonesia
(4) National Research and Innovation Agency, Indonesia

Abstract

Purpose: Numerous factors can affect the duration of COVID-19 recovery. One method involves utilizing natural herbal medication. This study seeks to determine the variables influencing the duration of COVID-19 recovery and to compare discriminant analysis and support vector machine models using COVID-19 patient data from West Sumatra.

Methods: Two data mining methods, Discriminant Analysis and Support Vector Machine with different types of kernels (linear, polynomial, and radial basis function), were employed to categorize the time of COVID-19 recovery in this work. The study utilized 428 data points, with 75% allocated for training data and 25% for testing data. The independent factors were evaluated by determining the selection variables' information value (IV) to gauge their influence on the dependent variable. Data resampling techniques were employed to tackle the problem of data imbalance. This study employs data resampling techniques, including undersampling, oversampling, and SMOTE. The balancing accuracy of Discriminant Analysis and Support Vector Machine was examined.

Result: The Discriminant Analysis with SMOTE achieved a balanced accuracy of 66.50%, outperforming the linear kernel Support Vector Machine with SMOTE, which had a balanced accuracy of 63.20% in this dataset.

Novelty: This study assessed the novelty, originality, and value by comparing Discriminant Analysis and SVM algorithms with categorical and continuous independent variables. This research explores techniques for managing imbalanced data using undersampling, oversampling, and SMOTE, with variable selection based on information value assessment. 

Keywords

Discriminant analysis; Support vector machine; Mixed independent variable; Resampling; COVID-19

Full Text:

PDF

References

Z. Wu and J. M. McGoogan, “Characteristics of and Important Lessons From the Coronavirus Disease 2019 (COVID-19) Outbreak in China,” Jama, vol. 323, no. 13, p. 1239, 2020, doi: 10.1001/jama.2020.2648.

P. K. Perera and A. C. B. Meedeniya, “Curcumin as a Potential Treatment for COVID-19,” Front. Pharmacol., vol. 12, no. September 2021, pp. 1–10, 2021, doi: 10.3389/fphar.2021.675287.

Ö. Güngör and H. Baykal, “Attitudes toward herbal medicine for COVID-19 in healthcare workers: A cross-sectional observational study,” Med. (United States), vol. 102, no. 38, p. E35176, 2023, doi: 10.1097/MD.0000000000035176.

J. Ren, A. Zhang, and X. Wang, “Traditional Chineses Medicine for Covid-19 Treatment,” Pharmacol. Res., p. 104743, 2020, doi: 10.1016/j.phrs.2020.104743.

R. F. Noor’An, Karmilasanti, and C. B. Wiati, “Potential and distribution of Vitex sp and Peronema canescens jack as anti -COVID 19 plants in East Kalimantan Province, Indonesia,” IOP Conf. Ser. Earth Environ. Sci., vol. 886, no. 1, 2021, doi: 10.1088/1755-1315/886/1/012030.

A. P. Yani, A. Ruyani, I. Ansyori, and R. Irwanto, “UJI POTENSI DAUN MUDA SUNGKAI (Peronema canescens) UNTUK KESEHATAN (IMUNITAS) PADA MENCIT (Mus.muculus) The Potential Test of Sungkai Young Leaves (Peronema canescens) to Maintain Goodhelth (Immunity)in Mice (Mus musculus),” Semin. Nas. XI Pendidik. Biol. FKIP UNS 245, pp. 245–250, 2014.

M. Kakehashi and S. Kawano, Fundamentals of Mathematical Models of Infectious Diseases and Their Application to Data Analyses, 1st ed., vol. 36. Elsevier B.V., 2017. doi: 10.1016/bs.host.2017.06.002.

S. A. Lauer et al., “The incubation period of coronavirus disease 2019 (CoVID-19) from publicly reported confirmed cases: Estimation and application,” Ann. Intern. Med., vol. 172, no. 9, pp. 577–582, 2020, doi: 10.7326/M20-0504.

C. Elias, A. Sekri, P. Leblanc, M. Cucherat, and P. Vanhems, “The incubation period of COVID-19: A meta-analysis,” Int. J. Infect. Dis., vol. 104, pp. 708–710, 2021, doi: 10.1016/j.ijid.2021.01.069.

E. Zdravevski, P. Lameski, A. Kulakov, and D. Gjorgjevikj, “Feature selection and allocation to diverse subsets for multi-label learning problems with large datasets,” 2014 Fed. Conf. Comput. Sci. Inf. Syst. FedCSIS 2014, vol. 2, pp. 387–394, 2014, doi: 10.15439/2014F500.

A. J. Izenman, Linear Discriminant Analysis 8.1. 2013. doi: 10.1007/978-0-387-78189-1.

Y. Sun, A. K. C. Wong, and M. S. Kamel, “Classification of imbalanced data: A review,” Int. J. Pattern Recognit. Artif. Intell., vol. 23, no. 4, pp. 687–719, 2009, doi: 10.1142/S0218001409007326.

R. Van Den Goorbergh, M. Van Smeden, D. Timmerman, and Ben Van Calster, “The harm of class imbalance corrections for risk prediction models: Illustration and simulation using logistic regression,” J. Am. Med. Informatics Assoc., vol. 29, no. 9, pp. 1525–1534, 2022, doi: 10.1093/jamia/ocac093.

R. Rofik, R. Aulia, K. Musaadah, S. S. F. Ardyani, and A. A. Hakim, “Optimization of Credit Scoring Model Using Stacking Ensemble Learning and Oversampling Techniques,” J. Inf. Syst. Explor. Res., vol. 2, no. 1, pp. 11–20, 2023, doi: 10.52465/joiser.v2i1.203.

J. L. Leevy, J. M. Johnson, J. Hancock, and T. M. Khoshgoftaar, “Threshold optimization and random undersampling for imbalanced credit card data,” J. Big Data, vol. 10, no. 1, 2023, doi: 10.1186/s40537-023-00738-z.

Q. Shi and H. Zhang, “Fault Diagnosis of an Autonomous Vehicle with an Improved SVM Algorithm Subject to Unbalanced Datasets,” IEEE Trans. Ind. Electron., vol. 68, no. 7, pp. 6248–6256, 2021, doi: 10.1109/TIE.2020.2994868.

L. Qadrini, “Oversampling , Undersampling , Smote SVM dan Random Forest pada Klasifikasi Penerima Bidikmisi Sejawa Timur Tahun 2017,” vol. 3, no. 4, pp. 386–391, 2022, doi: 10.47065/josyc.v3i4.2154.

A. R. Safitri and M. A. Muslim, “Improved Accuracy of Naive Bayes Classifier for Determination of Customer Churn Uses SMOTE and Genetic Algorithms,” J. Soft Comput. Explor., vol. 1, no. 1, pp. 70–75, 2020, doi: 10.52465/joscex.v1i1.5.

N. G. Ramadhan, “Comparative Analysis of ADASYN-SVM and SMOTE-SVM Methods on the Detection of Type 2 Diabetes Mellitus,” Sci. J. Informatics, vol. 8, no. 2, pp. 276–282, 2021, doi: 10.15294/sji.v8i2.32484.

A. Ishaq et al., “Improving the Prediction of Heart Failure Patients’ Survival Using SMOTE and Effective Data Mining Techniques,” IEEE Access, vol. 9, pp. 39707–39716, 2021, doi: 10.1109/ACCESS.2021.3064084.

E. Esenogho, I. D. Mienye, T. G. Swart, K. Aruleba, and G. Obaido, “A Neural Network Ensemble with Feature Engineering for Improved Credit Card Fraud Detection,” IEEE Access, vol. 10, pp. 16400–16407, 2022, doi: 10.1109/ACCESS.2022.3148298.

L. Wang, M. Han, X. Li, N. Zhang, and H. Cheng, “Review of Classification Methods on Unbalanced Data Sets,” IEEE Access, vol. 9, pp. 64606–64628, 2021, doi: 10.1109/ACCESS.2021.3074243.

A. H. Ali, Z. F. Hussain, and S. N. Abd, “Big Data Classification Efficiency Based on Linear Discriminant Analysis,” Iraqi J. Comput. Sci. Math., vol. 1, no. 1, pp. 9–14, 2020, doi: 10.52866/ijcsm.2019.01.01.001.

M. A. Mukid and T. Widiharih, “Model Penilaian Kredit Menggunakan Analisis Diskriminan Dengan Variabel Bebas Campuran Biner Dan Kontinu,” Media Stat., vol. 9, no. 2, p. 107, 2017, doi: 10.14710/medstat.9.2.107-117.

A. Mbina Mbina, G. M. Nkiet, and F. Eyi Obiang, “Variable selection in discriminant analysis for mixed continuous-binary variables and several groups,” Adv. Data Anal. Classif., vol. 13, no. 3, pp. 773–795, 2019, doi: 10.1007/s11634-018-0343-0.

N. Walidaini, M. A. Mukid, A. Prahutama, and A. Rusgiyono, “Analisis Diskriminan Berganda Dengan Peubah Bebas Campuran Kategorik Dan Kontinu Pada Klasifikasi Indeks Prestasi Kumulatif Mahasiswa,” Media Stat., vol. 10, no. 2, p. 71, 2017, doi: 10.14710/medstat.10.2.71-83.

S. Guhathakurata, S. Kundu, A. Chakraborty, and J. S. Banerjee, “A novel approach to predict COVID-19 using support vector machine,” no. January, pp. 351–364, 2020.

G. James, D. Witten, T. Hastie, and R. Tibshirani, “An introduction to statistical learning (2nd ed.), website,” Springer texts, vol. 102, p. 618, 2021.

N. I. Mahat, W. J. Krzanowski, and A. Hernandez, “Variable selection in discriminant analysis based on the location model for mixed variables,” pp. 105–122, 2007, doi: 10.1007/s11634-007-0009-9.

S. Guhathakurata, K. Souvik, A. Chakraborty, and J. S. Banerjee, “A novel approach to predict COVID-19 using support vector machine,” Glob. Heal., vol. 167, no. 1, pp. 1–5, 2020.

B. Lund and D. Brotherton, “Information Value Statistic,” Mark. Assoc. LLC, no. 2010, pp. 1–18, 2013.

C. Nguyen, X. Li, S. Blanton, and X. Li, “Efficient Classification via Partial Co-Training for Virtual Metrology,” IEEE Int. Conf. Emerg. Technol. Fact. Autom. ETFA, vol. 2020-Septe, pp. 753–760, 2020, doi: 10.1109/ETFA46521.2020.9212012.

B. Stojanović et al., “Follow the trail: Machine learning for fraud detection in fintech applications,” Sensors, vol. 21, no. 5, pp. 1–43, 2021, doi: 10.3390/s21051594.

J. Kim, “suffer from its expensive data acquisition process and the la- Jinwoo Shin Korea Advanced Institute of Science and Technology ( KAIST ) Daejeon , South Korea M2m : Imbalanced Classification via Major-to-minor Translation,” IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020.

B. Pes, “Learning from high-dimensional and class-imbalanced datasets using random forests,” Inf., vol. 12, no. 8, 2021, doi: 10.3390/info12080286.

S. J. Yen and Y. S. Lee, “Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset,” Lect. Notes Control Inf. Sci., vol. 344, pp. 731–740, 2006, doi: 10.1007/978-3-540-37256-1_89.

D. K. Choubey, S. Tripathi, P. Kumar, V. Shukla, and V. K. Dhandhania, “Classification of Diabetes by Kernel Based SVM with PSO,” Recent Adv. Comput. Sci. Commun., vol. 14, no. 4, pp. 1242–1255, 2019, doi: 10.2174/2213275912666190716094836.

M. Hussain, S. K. Wajid, A. Elzaart, and M. Berbar, “A comparison of SVM kernel functions for breast cancer detection,” Proc. - 2011 8th Int. Conf. Comput. Graph. Imaging Vis. CGIV 2011, pp. 145–150, 2011, doi: 10.1109/CGIV.2011.31.

A. Onan and M. A. Tocoglu, “A Term Weighted Neural Language Model and Stacked Bidirectional LSTM Based Framework for Sarcasm Identification,” IEEE Access, vol. 9, pp. 7701–7722, 2021, doi: 10.1109/ACCESS.2021.3049734.

M. Gösgens, A. Zhiyanov, A. Tikhonov, and L. Prokhorenkova, “Good Classification Measures and How to Find Them,” Adv. Neural Inf. Process. Syst., vol. 21, no. NeurIPS, pp. 17136–17147, 2021.

D. Chicco, N. Tötsch, and G. Jurman, “The matthews correlation coefficient (Mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation,” BioData Min., vol. 14, pp. 1–22, 2021, doi: 10.1186/s13040-021-00244-z.

R. Bharti, A. Khamparia, M. Shabaz, G. Dhiman, S. Pande, and P. Singh, “Prediction of Heart Disease Using a Combination of Machine Learning and Deep Learning,” Comput. Intell. Neurosci., vol. 2021, 2021, doi: 10.1155/2021/8387680.

D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 1, pp. 1–13, 2020, doi: 10.1186/s12864-019-6413-7.

A. Singh and R. Kumar, “Heart Disease Prediction Using Different Machine Learning Algorithms,” Proc. - 2022 IEEE World Conf. Appl. Intell. Comput. AIC 2022, pp. 60–65, 2022, doi: 10.1109/AIC55036.2022.9848885.

R. Trevethan, “Sensitivity, Specificity, and Predictive Values: Foundations, Pliabilities, and Pitfalls in Research and Practice,” Front. Public Heal., vol. 5, no. November, pp. 1–7, 2017, doi: 10.3389/fpubh.2017.00307.

G. Varoquaux and O. Colliot, “Evaluating machine learning models and their diagnostic value,” p. 3, 2022.

J. S. Akosa, “Predictive accuracy: A misleading performance measure for highly imbalanced data,” SAS Glob. Forum, vol. 942, pp. 1–12, 2017.

T. Hungsapruek, “Compare Between Personal Factors & Healthcare Service Needs after Becoming Senior Citizens,” vol. 20, no. 3, pp. 1419–1439, 2021.

V. Kumar et al., “Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques,” Healthc., vol. 10, no. 7, 2022, doi: 10.3390/healthcare10071293.

C. W. Hsu, C. C. Chang, and C. J. Lin, “A Practical Guide to Support Vector Classification,” http://www.csie.ntu.edu.tw/~cjlin, 2016.

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.