Application Of SMOTE To Address Class Imbalance In Diabetes Disease Classification Utilizing C5.0, Random Forest, And SVM
Abstract
The implementation of SMOTE to tackle class imbalance in classification frequently results in suboptimal outcomes, owing to the intricacy of the dataset and the multitude of attributes at play. Consequently, alternative classification models were explored through experimentation to gauge their precision. This research aims to compare the precision of C5.0, Random Forest, and SVM classification models both with and without SMOTE. The methodology encompasses dataset selection, an overview of classification algorithms (C5.0, Random Forest, SVM), SMOTE technique, validation via split validation, preprocessing involving min-max normalization, and execution evaluation utilizing confusion matrices and AUC analysis. The dataset was sourced by Kaggle, specifically to rectify class imbalance in a diabetes dataset using SMOTE, consisting of 768 instances, with 268 samples for diabetic cases and 500 samples for non-diabetic cases. Prior to SMOTE application, the classification precision for C5.0, Random Forest, and SVM were 0.714, 0.733, and 0.746 respectively, with corresponding AUC values of 0.745, 0.824, and 0.799. Post-SMOTE, the precision depicts for the same techniques were 0.603, 0.727, and 0.727, with AUC values of 0.734, 0.831, and 0.794 respectively. It can be inferred that there's minimal impact post-SMOTE across the three classification models due to potential overfitting on the dataset, leading to excessive reliance on synthesized data for minority classes, resulting in diminished model execution, precision, and AUC scores.
Downloads
References
E. Subandi and K. Adam, “Modern Dressing Terhadap Penyembuhan Luka Diabetes Melitus Tipe 2 Proses,” J. Kesehat., vol. 10, no. 1, pp. 1273–1283, 2019.
M. E. Fitriyanti, H. Febriawati, and L. Yanti, “Pengalaman
Penderita Diabetes Mellitus Dalam Pencegahan Ulkus Diabetik,” J. Keperawatan Muhammadiyah Bengkulu, vol. 07, pp. 597–603, 2019.
M. Abedini, A. Bijari, and T. Banirostam, “Classification of Pima Indian Diabetes Dataset using Ensemble of Decision Tree, Logistic Regression and Neural Network,” Ijarcce, vol. 9, no. 7, pp. 1–4, 2020, doi: 10.17148/ijarcce.2020.9701.
H. Pangestika, D. Ekawati, and N. S. Murni, “Faktor-Faktor Yang Berhubungan Dengan Kejadian Diabetes Mellitus Tipe 2,” J. ’Aisyiyah Med., vol. 7, no. 1, 2022, doi: 10.36729/jam.v7i1.779.
E. Sutoyo and M. A. Fadlurrahman, “Penerapan SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Television Advertisement Performance Rating Menggunakan Artificial Neural Network,” J. Edukasi dan Penelit. Inform., vol. 6, no. 3, p. 379, 2020, doi: 10.26418/jp.v6i3.42896.
D. Ayu Wahyuning Dewi, I. Cholissodin, and Sutrisno, “Klasifikasi Penyimpangan Tumbuh Kembang Anak Menggunakan Algoritme C5.0,” J. Pengemb. Teknol. Inf. dan Ilmu Komput., vol. 3, no. 10, pp. 10258–10265, 2019, [Online]. Available: http://j-ptiik.ub.ac.id
Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data,” J. Biomed. Inform., vol. 107, no. May 2019, p. 103465, 2020, doi: 10.1016/j.jbi.2020.103465.
A. I. Kusumarini, P. A. Hogantara, M. Fadhlurohman, and S. K. . M. K. Nurul Chamidah, Perbandingan Algoritma Random Forest, Naive Bayes, Dan Decision Tree Dengan Oversampling Untuk Klasifikasi Bakteri E.Coli, vol. 2, no. 1. 2021.
N. G. Ramadhan, “Comparative Analysis of ADASYN-SVM and SMOTE-SVM Methods on the Detection of Type 2 Diabetes Mellitus,” Sci. J. Informatics, vol. 8, no. 2, pp. 276–282, 2021, doi: 10.15294/sji.v8i2.32484.
H. Hasanah and Nurmalitasari, “Perbandingan Tingkat Akurasi Algoritma Support Vector Machines ( SVM ) dan C45 dalam Prediksi Penyakit Jantung,” Pros. Semin. Nas. Teknol. dan Sains, vol. 2, pp. 13–18, 2023.
E. Purwanti, R. U. N. U. Nor, and S. Soelistyono, “Web Design for Stroke Early Detection Using Decision Tree C5.0,” Komputasi J. Ilm. Ilmu Komput. dan Mat., vol. 20, no. 2, pp. 135–147, 2023, doi: 10.33751/komputasi.v20i2.8265.
R. N. Amalda, N. Millah, and I. Fitria, “Implementasi Algoritma C5.0 Dalam Menganalisa Kelayakan Penerima Keringanan Ukt Mahasiswa Itk,” Teorema Teor. dan Ris. Mat., vol. 7, no. 1, p. 101, 2022, doi: 10.25157/teorema.v7i1.6692.
J. Zhang and L. Chen, “Clustering-based undersampling with random over sampling examples and support vector machine for imbalanced classification of breast cancer diagnosis,” Comput. Assist. Surg., vol. 24, no. sup2, pp. 62–72, 2019, doi: 10.1080/24699322.2019.1649074.
A. C. Wijaya, N. A. Hasibuan, and P. Ramadhani, “Implementasi Algoritma C5.0 Dalam Klasifikasi Pendapatan Masyarakat (Studi Kasus: Kelurahan Mesjid Kecamatan Medan Kota),” Maj. Ilm. INTI, vol. 5, 2018.
D. P. Utomo, P. Sirait, and R. Yunis, “Reduksi Atribut Pada Dataset Penyakit Jantung dan Klasifikasi Menggunakan Algoritma C5.0,” J. Media Inform. Budidarma, vol. 4, no. 4, pp. 994–1006, 2020, doi: 10.30865/mib.v4i4.2355.
M. R. Ansyari, M. I. Mazdadi, F. Indriani, D. Kartini, and T. H. Saragih, “Implementation of Random Forest and Extreme Gradient Boosting in the Classification of Heart Disease using Particle Swarm Optimization Feature Selection,” J. Electron. Electromed. Eng. Med. Informatics, vol. 5, no. 4, pp. 250–260, 2023, doi: 10.35882/jeeemi.v5i4.322.
H. Tyralis, G. Papacharalampous, and A. Langousis, “A brief review of random forests for water scientists and practitioners and their recent history in water resources,” Water (Switzerland), vol. 11, no. 5, 2019, doi: 10.3390/w11050910.
D. H. Depari, Y. Widiastiwi, and M. M. Santoni, “Perbandingan Model Decision Tree, Naive Bayes dan Random Forest untuk Prediksi Klasifikasi Penyakit Jantung,” Inform. J. Ilmu Komput., vol. 18, no. 3, p. 239, 2022, doi: 10.52958/iftk.v18i3.4694.
X. Tan et al., “Wireless sensor networks intrusion detection based on SMOTE and the random forest algorithm,” Sensors (Switzerland), vol. 19, no. 1, 2019, doi: 10.3390/s19010203.
E. Erlin, Y. Desnelita, N. Nasution, L. Suryati, and F. Zoromi, “Dampak SMOTE terhadap Kinerja Random Forest Classifier berdasarkan Data Tidak seimbang,” MATRIK J. Manajemen, Tek. Inform. dan Rekayasa Komput., vol. 21, no. 3, pp. 677–690, 2022, doi: 10.30812/matrik.v21i3.1726.
Muhamad Fawwaz Akbar, Muhammad Itqan Mazdadi, Muliadi, Triando Hamonangan Saragih, and Friska Abadi, “Implementation of Information Gain Ratio and Particle Swarm Optimization in the Sentiment Analysis Classification of Covid-19 Vaccine Using Support Vector Machine,” J. Electron. Electromed. Eng. Med. Informatics, vol. 5, no. 4, pp. 261–270, 2023, doi: 10.35882/jeeemi.v5i4.328.
Y. Ferdinand and W. F. Al Maki, “Broccoli leaf diseases classification using support vector machine with particle swarm optimization based on feature selection,” Int. J. Adv. Intell. Informatics, vol. 8, no. 3, pp. 337–348, 2022, doi: 10.26555/ijain.v8i3.951.
I. Ahmad, M. Basheri, M. J. Iqbal, and A. Rahim, “Performance Comparison of Support Vector Machine, Random Forest, and Extreme Learning Machine for Intrusion Detection,” IEEE Access, vol. 6, pp. 33789–33795, 2018, doi: 10.1109/ACCESS.2018.2841987.
A. Bhavani and B. Santhosh Kumar, “A Review of State Art of Text Classification Algorithms,” Proc. - 5th Int. Conf. Comput. Methodol. Commun. ICCMC 2021, no. April 2021, pp. 1484–1490, 2021, doi: 10.1109/ICCMC51019.2021.9418262.
C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models,” IEEE Trans. Softw. Eng., vol. 46, no. 11, pp. 1200–1219, 2020, doi: 10.1109/TSE.2018.2876537.
A. S. Hussein, T. Li, C. W. Yohannese, and K. Bashir, “A-SMOTE: A new preprocessing approach for highly imbalanced datasets by improving SMOTE,” Int. J. Comput. Intell. Syst., vol. 12, no. 2, pp. 1412–1422, 2019, doi: 10.2991/ijcis.d.191114.002.
H. Hairani, A. Anggrawan, and D. Priyanto, “Improvement Performance of the Random Forest Method on Unbalanced Diabetes
Data Classification Using Smote-Tomek Link,” Int. J. Informatics Vis., vol. 7, no. 1, pp. 258–264, 2023, doi: 10.30630/joiv.7.1.1069.
H. Al Majzoub and I. Elgedawy, “AB-SMOTE: An Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification,” Int. J. Mach. Learn. Comput., vol. 10, no. 1, pp. 31–37, 2020, doi: 10.18178/ijmlc.2020.10.1.894.
M. Sulistiyono, Y. Pristyanto, S. Adi, and G. Gumelar, “Implementasi Algoritma Synthetic Minority Over-Sampling Technique untuk Menangani Ketidakseimbangan Kelas pada Dataset Klasifikasi,” Sistemasi, vol. 10, no. 2, p. 445, 2021, doi: 10.32520/stmsi.v10i2.1303.
M. F. Ijaz, G. Alfian, M. Syafrudin, and J. Rhee, “Hybrid Prediction Model for type 2 diabetes and hypertension using DBSCAN-based outlier detection, Synthetic Minority Over Sampling Technique (SMOTE), and random forest,” Appl. Sci., vol. 8, no. 8, 2018, doi: 10.3390/app8081325.
J. H. J. C. Ortega, “Analysis of Performance of Classification Algorithms in Mushroom Poisonous Detection using Confusion Matrix Analysis,” Int. J. Adv. Trends Comput. Sci. Eng., vol. 9, no. 1.3, pp. 451–456, 2020, doi: 10.30534/ijatcse/2020/7191.32020.
D. Valero-Carreras, J. Alcaraz, and M. Landete, “Comparing two SVM models through different metrics based on the confusion matrix,” Comput. Oper. Res., vol. 152, no. April 2022, p. 106131, 2023, doi: 10.1016/j.cor.2022.106131.
A. Luque, A. Carrasco, A. Martín, and A. de las Heras, “The impact of class imbalance in classification performance metrics based on the binary confusion matrix,” Pattern Recognit., vol. 91, pp. 216–231, 2019, doi: 10.1016/j.patcog.2019.02.023.
H. Wang, B. Zheng, S. W. Yoon, and H. S. Ko, “A support vector machine-based ensemble algorithm for breast cancer diagnosis,” Eur. J. Oper. Res., vol. 267, no. 2, pp. 687–699, 2018, doi: 10.1016/j.ejor.2017.12.001.
Shalehah, Muhammad Itqan Mazdadi, Andi Farmadi, Dwi Kartini, and Muliadi, “Implementation of Particle Swarm Optimization Feature Selection on Naïve Bayes for Thoracic Surgery Classification,” J. Electron. Electromed. Eng. Med. Informatics, vol. 5, no. 3, pp. 150–158, 2023, doi: 10.35882/jeemi.v5i3.305.
V. Sari, F. Firdausi, and Y. Azhar, “Perbandingan Prediksi Kualitas Kopi Arabika dengan Menggunakan Algoritma SGD, Random Forest dan Naive Bayes,” Edumatic J. Pendidik. Inform., vol. 4, no. 2, pp. 1–9, 2020, doi: 10.29408/edumatic.v4i2.2202.
D. Pramadhana, “Klasifikasi Penyakit Diabetes Menggunakan Metode CFS dan ROS dengan Algoritma J48 Berbasis Adaboost,” Edumatic J. Pendidik. Inform., vol. 5, no. 1, pp. 89–98, 2021, doi: 10.29408/edumatic.v5i1.3336.
S. Sinsomboonthong, “Performance Comparison of New Adjusted Min-Max with Decimal Scaling and Statistical Column Normalization Methods for Artificial Neural Network Classification,” Int. J. Math. Math. Sci., vol. 2022, 2022, doi: 10.1155/2022/3584406.
S. A. D. Prasetyowati, M. Ismail, E. N. Budisusila, D. R. I. M. Setiadi, and M. H. Purnomo, “Dataset Feasibility Analysis Method based on Enhanced Adaptive LMS method with Min-max Normalization and Fuzzy Intuitive Sets,” Int. J. Electr. Eng. Informatics, vol. 14, no. 1, pp. 55–75, 2022, doi: 10.15676/ijeei.2022.14.1.4.
A. Ambarwari, Q. J. Adrian, and Y. Herdiyeni, “Analisis Pengaruh Data Scaling Terhadap Performa Algoritme Machine Learning untuk Identifikasi Tanaman,” J. RESTI(Rekayasa Sist. dan Teknol. Inf. ), vol. 1, no. 3, pp. 117–122, 2017.
Copyright (c) 2024 M. Khairul Rezki, Muhammad Itqan Mazdadi, Fatma Indriani, Muliadi, Triando Hamonangan Saragih, Vijay Annant Athavale
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).