Application Of SMOTE To Address Class Imbalance In Diabetes Disease Classification Utilizing C5.0, Random Forest, And SVM

M. Khairul Rezki; Muhammad Itqan  Mazdadi; Fatma  Indriani; Muliadi Muliadi; Triando Hamonangan  Saragih; Vijay Annant  Athavale

doi:10.35882/jeeemi.v6i4.434

M. Khairul Rezki Faculty of Mathematics and Natural Sciences, Lambung Mangkurat University, Banjarmasin, Indonesia https://orcid.org/0009-0008-8979-0120
Muhammad Itqan Mazdadi Faculty of Mathematics and Natural Sciences, Lambung Mangkurat University, Banjarmasin, Indonesia https://orcid.org/0000-0002-8710-4616
Fatma Indriani Faculty of Mathematics and Natural Sciences, Lambung Mangkurat University, Banjarmasin, Indonesia https://orcid.org/0009-0006-7180-6708
Muliadi Muliadi Faculty of Mathematics and Natural Sciences, Lambung Mangkurat University, Banjarmasin, Indonesia https://orcid.org/0000-0003-2871-9482
Triando Hamonangan Saragih Faculty of Mathematics and Natural Sciences, Lambung Mangkurat University, Banjarmasin, Indonesia https://orcid.org/0000-0003-4346-3323
Vijay Annant Athavale Walchand Institute of Technology, Solapur, India https://orcid.org/0000-0002-6812-5198

DOI: https://doi.org/10.35882/jeeemi.v6i4.434

Abstract

The implementation of SMOTE to tackle class imbalance in classification frequently results in suboptimal outcomes, owing to the intricacy of the dataset and the multitude of attributes at play. Consequently, alternative classification models were explored through experimentation to gauge their precision. This research aims to compare the precision of C5.0, Random Forest, and SVM classification models both with and without SMOTE. The methodology encompasses dataset selection, an overview of classification algorithms (C5.0, Random Forest, SVM), SMOTE technique, validation via split validation, preprocessing involving min-max normalization, and execution evaluation utilizing confusion matrices and AUC analysis. The dataset was sourced by Kaggle, specifically to rectify class imbalance in a diabetes dataset using SMOTE, consisting of 768 instances, with 268 samples for diabetic cases and 500 samples for non-diabetic cases. Prior to SMOTE application, the classification precision for C5.0, Random Forest, and SVM were 0.714, 0.733, and 0.746 respectively, with corresponding AUC values of 0.745, 0.824, and 0.799. Post-SMOTE, the precision depicts for the same techniques were 0.603, 0.727, and 0.727, with AUC values of 0.734, 0.831, and 0.794 respectively. It can be inferred that there's minimal impact post-SMOTE across the three classification models due to potential overfitting on the dataset, leading to excessive reliance on synthesized data for minority classes, resulting in diminished model execution, precision, and AUC scores.

Downloads

Download data is not yet available.

References

E. Subandi and K. Adam, “Modern Dressing Terhadap Penyembuhan Luka Diabetes Melitus Tipe 2 Proses,” J. Kesehat., vol. 10, no. 1, pp. 1273–1283, 2019.

M. E. Fitriyanti, H. Febriawati, and L. Yanti, “Pengalaman

Penderita Diabetes Mellitus Dalam Pencegahan Ulkus Diabetik,” J. Keperawatan Muhammadiyah Bengkulu, vol. 07, pp. 597–603, 2019.

M. Abedini, A. Bijari, and T. Banirostam, “Classification of Pima Indian Diabetes Dataset using Ensemble of Decision Tree, Logistic Regression and Neural Network,” Ijarcce, vol. 9, no. 7, pp. 1–4, 2020, doi: 10.17148/ijarcce.2020.9701.

H. Pangestika, D. Ekawati, and N. S. Murni, “Faktor-Faktor Yang Berhubungan Dengan Kejadian Diabetes Mellitus Tipe 2,” J. ’Aisyiyah Med., vol. 7, no. 1, 2022, doi: 10.36729/jam.v7i1.779.

E. Sutoyo and M. A. Fadlurrahman, “Penerapan SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Television Advertisement Performance Rating Menggunakan Artificial Neural Network,” J. Edukasi dan Penelit. Inform., vol. 6, no. 3, p. 379, 2020, doi: 10.26418/jp.v6i3.42896.

D. Ayu Wahyuning Dewi, I. Cholissodin, and Sutrisno, “Klasifikasi Penyimpangan Tumbuh Kembang Anak Menggunakan Algoritme C5.0,” J. Pengemb. Teknol. Inf. dan Ilmu Komput., vol. 3, no. 10, pp. 10258–10265, 2019, [Online]. Available: http://j-ptiik.ub.ac.id

Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data,” J. Biomed. Inform., vol. 107, no. May 2019, p. 103465, 2020, doi: 10.1016/j.jbi.2020.103465.

A. I. Kusumarini, P. A. Hogantara, M. Fadhlurohman, and S. K. . M. K. Nurul Chamidah, Perbandingan Algoritma Random Forest, Naive Bayes, Dan Decision Tree Dengan Oversampling Untuk Klasifikasi Bakteri E.Coli, vol. 2, no. 1. 2021.

N. G. Ramadhan, “Comparative Analysis of ADASYN-SVM and SMOTE-SVM Methods on the Detection of Type 2 Diabetes Mellitus,” Sci. J. Informatics, vol. 8, no. 2, pp. 276–282, 2021, doi: 10.15294/sji.v8i2.32484.

H. Hasanah and Nurmalitasari, “Perbandingan Tingkat Akurasi Algoritma Support Vector Machines ( SVM ) dan C45 dalam Prediksi Penyakit Jantung,” Pros. Semin. Nas. Teknol. dan Sains, vol. 2, pp. 13–18, 2023.

E. Purwanti, R. U. N. U. Nor, and S. Soelistyono, “Web Design for Stroke Early Detection Using Decision Tree C5.0,” Komputasi J. Ilm. Ilmu Komput. dan Mat., vol. 20, no. 2, pp. 135–147, 2023, doi: 10.33751/komputasi.v20i2.8265.

R. N. Amalda, N. Millah, and I. Fitria, “Implementasi Algoritma C5.0 Dalam Menganalisa Kelayakan Penerima Keringanan Ukt Mahasiswa Itk,” Teorema Teor. dan Ris. Mat., vol. 7, no. 1, p. 101, 2022, doi: 10.25157/teorema.v7i1.6692.

J. Zhang and L. Chen, “Clustering-based undersampling with random over sampling examples and support vector machine for imbalanced classification of breast cancer diagnosis,” Comput. Assist. Surg., vol. 24, no. sup2, pp. 62–72, 2019, doi: 10.1080/24699322.2019.1649074.

A. C. Wijaya, N. A. Hasibuan, and P. Ramadhani, “Implementasi Algoritma C5.0 Dalam Klasifikasi Pendapatan Masyarakat (Studi Kasus: Kelurahan Mesjid Kecamatan Medan Kota),” Maj. Ilm. INTI, vol. 5, 2018.

D. P. Utomo, P. Sirait, and R. Yunis, “Reduksi Atribut Pada Dataset Penyakit Jantung dan Klasifikasi Menggunakan Algoritma C5.0,” J. Media Inform. Budidarma, vol. 4, no. 4, pp. 994–1006, 2020, doi: 10.30865/mib.v4i4.2355.

M. R. Ansyari, M. I. Mazdadi, F. Indriani, D. Kartini, and T. H. Saragih, “Implementation of Random Forest and Extreme Gradient Boosting in the Classification of Heart Disease using Particle Swarm Optimization Feature Selection,” J. Electron. Electromed. Eng. Med. Informatics, vol. 5, no. 4, pp. 250–260, 2023, doi: 10.35882/jeeemi.v5i4.322.

H. Tyralis, G. Papacharalampous, and A. Langousis, “A brief review of random forests for water scientists and practitioners and their recent history in water resources,” Water (Switzerland), vol. 11, no. 5, 2019, doi: 10.3390/w11050910.

D. H. Depari, Y. Widiastiwi, and M. M. Santoni, “Perbandingan Model Decision Tree, Naive Bayes dan Random Forest untuk Prediksi Klasifikasi Penyakit Jantung,” Inform. J. Ilmu Komput., vol. 18, no. 3, p. 239, 2022, doi: 10.52958/iftk.v18i3.4694.

X. Tan et al., “Wireless sensor networks intrusion detection based on SMOTE and the random forest algorithm,” Sensors (Switzerland), vol. 19, no. 1, 2019, doi: 10.3390/s19010203.

E. Erlin, Y. Desnelita, N. Nasution, L. Suryati, and F. Zoromi, “Dampak SMOTE terhadap Kinerja Random Forest Classifier berdasarkan Data Tidak seimbang,” MATRIK J. Manajemen, Tek. Inform. dan Rekayasa Komput., vol. 21, no. 3, pp. 677–690, 2022, doi: 10.30812/matrik.v21i3.1726.

Muhamad Fawwaz Akbar, Muhammad Itqan Mazdadi, Muliadi, Triando Hamonangan Saragih, and Friska Abadi, “Implementation of Information Gain Ratio and Particle Swarm Optimization in the Sentiment Analysis Classification of Covid-19 Vaccine Using Support Vector Machine,” J. Electron. Electromed. Eng. Med. Informatics, vol. 5, no. 4, pp. 261–270, 2023, doi: 10.35882/jeeemi.v5i4.328.

Y. Ferdinand and W. F. Al Maki, “Broccoli leaf diseases classification using support vector machine with particle swarm optimization based on feature selection,” Int. J. Adv. Intell. Informatics, vol. 8, no. 3, pp. 337–348, 2022, doi: 10.26555/ijain.v8i3.951.

I. Ahmad, M. Basheri, M. J. Iqbal, and A. Rahim, “Performance Comparison of Support Vector Machine, Random Forest, and Extreme Learning Machine for Intrusion Detection,” IEEE Access, vol. 6, pp. 33789–33795, 2018, doi: 10.1109/ACCESS.2018.2841987.

A. Bhavani and B. Santhosh Kumar, “A Review of State Art of Text Classification Algorithms,” Proc. - 5th Int. Conf. Comput. Methodol. Commun. ICCMC 2021, no. April 2021, pp. 1484–1490, 2021, doi: 10.1109/ICCMC51019.2021.9418262.

C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models,” IEEE Trans. Softw. Eng., vol. 46, no. 11, pp. 1200–1219, 2020, doi: 10.1109/TSE.2018.2876537.

A. S. Hussein, T. Li, C. W. Yohannese, and K. Bashir, “A-SMOTE: A new preprocessing approach for highly imbalanced datasets by improving SMOTE,” Int. J. Comput. Intell. Syst., vol. 12, no. 2, pp. 1412–1422, 2019, doi: 10.2991/ijcis.d.191114.002.

H. Hairani, A. Anggrawan, and D. Priyanto, “Improvement Performance of the Random Forest Method on Unbalanced Diabetes

Data Classification Using Smote-Tomek Link,” Int. J. Informatics Vis., vol. 7, no. 1, pp. 258–264, 2023, doi: 10.30630/joiv.7.1.1069.

H. Al Majzoub and I. Elgedawy, “AB-SMOTE: An Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification,” Int. J. Mach. Learn. Comput., vol. 10, no. 1, pp. 31–37, 2020, doi: 10.18178/ijmlc.2020.10.1.894.

M. Sulistiyono, Y. Pristyanto, S. Adi, and G. Gumelar, “Implementasi Algoritma Synthetic Minority Over-Sampling Technique untuk Menangani Ketidakseimbangan Kelas pada Dataset Klasifikasi,” Sistemasi, vol. 10, no. 2, p. 445, 2021, doi: 10.32520/stmsi.v10i2.1303.

M. F. Ijaz, G. Alfian, M. Syafrudin, and J. Rhee, “Hybrid Prediction Model for type 2 diabetes and hypertension using DBSCAN-based outlier detection, Synthetic Minority Over Sampling Technique (SMOTE), and random forest,” Appl. Sci., vol. 8, no. 8, 2018, doi: 10.3390/app8081325.

J. H. J. C. Ortega, “Analysis of Performance of Classification Algorithms in Mushroom Poisonous Detection using Confusion Matrix Analysis,” Int. J. Adv. Trends Comput. Sci. Eng., vol. 9, no. 1.3, pp. 451–456, 2020, doi: 10.30534/ijatcse/2020/7191.32020.

D. Valero-Carreras, J. Alcaraz, and M. Landete, “Comparing two SVM models through different metrics based on the confusion matrix,” Comput. Oper. Res., vol. 152, no. April 2022, p. 106131, 2023, doi: 10.1016/j.cor.2022.106131.

A. Luque, A. Carrasco, A. Martín, and A. de las Heras, “The impact of class imbalance in classification performance metrics based on the binary confusion matrix,” Pattern Recognit., vol. 91, pp. 216–231, 2019, doi: 10.1016/j.patcog.2019.02.023.

H. Wang, B. Zheng, S. W. Yoon, and H. S. Ko, “A support vector machine-based ensemble algorithm for breast cancer diagnosis,” Eur. J. Oper. Res., vol. 267, no. 2, pp. 687–699, 2018, doi: 10.1016/j.ejor.2017.12.001.

Shalehah, Muhammad Itqan Mazdadi, Andi Farmadi, Dwi Kartini, and Muliadi, “Implementation of Particle Swarm Optimization Feature Selection on Naïve Bayes for Thoracic Surgery Classification,” J. Electron. Electromed. Eng. Med. Informatics, vol. 5, no. 3, pp. 150–158, 2023, doi: 10.35882/jeemi.v5i3.305.

V. Sari, F. Firdausi, and Y. Azhar, “Perbandingan Prediksi Kualitas Kopi Arabika dengan Menggunakan Algoritma SGD, Random Forest dan Naive Bayes,” Edumatic J. Pendidik. Inform., vol. 4, no. 2, pp. 1–9, 2020, doi: 10.29408/edumatic.v4i2.2202.

D. Pramadhana, “Klasifikasi Penyakit Diabetes Menggunakan Metode CFS dan ROS dengan Algoritma J48 Berbasis Adaboost,” Edumatic J. Pendidik. Inform., vol. 5, no. 1, pp. 89–98, 2021, doi: 10.29408/edumatic.v5i1.3336.

S. Sinsomboonthong, “Performance Comparison of New Adjusted Min-Max with Decimal Scaling and Statistical Column Normalization Methods for Artificial Neural Network Classification,” Int. J. Math. Math. Sci., vol. 2022, 2022, doi: 10.1155/2022/3584406.

S. A. D. Prasetyowati, M. Ismail, E. N. Budisusila, D. R. I. M. Setiadi, and M. H. Purnomo, “Dataset Feasibility Analysis Method based on Enhanced Adaptive LMS method with Min-max Normalization and Fuzzy Intuitive Sets,” Int. J. Electr. Eng. Informatics, vol. 14, no. 1, pp. 55–75, 2022, doi: 10.15676/ijeei.2022.14.1.4.

A. Ambarwari, Q. J. Adrian, and Y. Herdiyeni, “Analisis Pengaruh Data Scaling Terhadap Performa Algoritme Machine Learning untuk Identifikasi Tanaman,” J. RESTI(Rekayasa Sist. dan Teknol. Inf. ), vol. 1, no. 3, pp. 117–122, 2017.