Refining Diabetes Diagnosis Models: The Impact of SMOTE on SVM, Logistic Regression, and Naïve Bayes

Keywords: Diabetes, Logistic Regression, Naive Bayes, SMOTE, Support Vector Machine

Abstract

Accurate diabetes classification is a significant challenge in medical diagnostics, especially in imbalanced datasets. This study addresses this issue by introducing A New Modified Weighted SMOTE (ANMWS), integrated with Priority of Attribute by Expert Judgement (PAEJ) framework, to enhance the performance of machine learning models for imbalanced data. PAEJ categorizes attributes into three levels—high, medium and low priority—based on expert knowledge, while ANMWS applies weighted oversampling using these priority levels to generate synthetic data more representative of real-world cases. The proposed method was evaluated using three algorithms: Support Vector Machine (SVM), Logistic Regression, and Naïve Bayes. Results indicate that applying ANMWS algorithm with PAEJ framework significantly improved predictive performance, with AUC values increasing to 0.995 for SVM, 0.993 for Logistic Regression, and 0.990 for Naïve Bayes, compared to 0.980, 0.978, and 0.975, respectively, using standard SMOTE. Additionally, precision and recall for SVM improved by 5% and 7%, respectively. These findings demonstrate the critical role of ANMWS algorithm and PAEJ framework in addressing class imbalance, providing a reliable method for early diabetes diagnosis and informed clinical decision-making.

Downloads

Download data is not yet available.

References

N. Nurdiana and A. Algifari, “Comparative Study of ID3 Algorithm and Naive Bayes Algorithm for the Classification of Diabetes Mellitus Disease,” INFOTECH Journal, 2020, [Online]. Available: https:// doi.org/10.31949/infotech.v6i2.816.

H. Apriyani, “Comparison of Naïve Bayes and Support Vector Machine Methods in Diabetes Mellitus Classification,” 2020, [Online]. Available: https://journal-computing.org/index.php/journal-ita/index

A. M. Widodo et al., “Performance of K-NN, J48, Naive Bayes, and Logistic Regression as Diabetes Classification Algorithms,” 2021, [Online]. Available: https://seminar.iaii.or.id/index.php/SISFOTEK/article/view/253

H. I. M. Karo Karo, “Diabetes Patient Classification Using Machine Learning Algorithms and Z-Score,” Jurnal Teknologi Terpadu, 2022, [Online]. Available: https:// doi.org/10.54914/jtt.v8i2.564

G. Abdurrahman, “Diabetes Mellitus Disease Classification Using Adaboost Classifier,” vol. 7, no. 1, 2022. [Online]. Available: http://jurnal.unmuhjember.ac.id/index.php/JUSTINDO/article/view/4949/3791

N. Marito Putry and B. Nurina Sari, “Comparison of KNN and Naive Bayes Algorithms for Diabetes Mellitus Classification,” Jurnal Sains dan Manajemen, vol. 10, no. 1, 2022, [Online]. Available: https:// doi.org/10.31294/evolusi.v10i1.12514

H. Hairani, K. E. Saputro, and S. Fadli, “K-means-SMOTE for Handling Class Imbalance in Diabetes Classification with C4.5, SVM, and Naive Bayes,” Jurnal Teknologi dan Sistem Komputer, vol. 8, no. 2, pp. 89–93, Apr. 2020, [Online]. Available: https:// doi.org/10.14710/jtsiskom.8.2.2020.89-93

Hartono, O. S. Sitompul, T. Tulus, and E. B. Nababan, "Biased support vector machine and weighted-smote in handling class imbalance problem," International Journal of Advances in Intelligent Informatics, vol. 4, no. 1, pp. 21–27, 2018. DOI: 10.26555/ijain.v4i1.146

M. R. Prusty, T. Jayanthi, and K. Velusamy, "Weighted-SMOTE: A Modification to SMOTE for Event Classification in Sodium Cooled Fast Reactors," Progress in Nuclear Energy, vol. 100, pp. 355–364, 2017. DOI: 10.1016/j.pnucene.2017.08.012

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: Synthetic Minority Over-sampling Technique," Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. DOI: 10.1613/jair.953

A. B. Cahyono and D. E. Fajar, "Analisis Pengaruh Teknologi Informasi terhadap Produktivitas Kerja," Jurnal SCAN, vol. 12, no. 1, pp. 45–56, 2020, DOI: 10.1234/scan.v12i1.1850.

M. A. Hasanah, S. Soim, and A. S. Handayani, “Implementation of CRISP-DM Model Using Decision Tree Method with CART Algorithm for Flood-Potential Rainfall Prediction,” 2021. [Online]. Available: http://jurnal.polibatam.ac.id/index.php/JAIC

A. M. M. Fattah, A. Voutama, N. Heryana, and N. Sulistiyowati, “Development of Machine Learning Regression Model as Web Service for Car Purchase Price Prediction Using CRISP-DM Method,” JURIKOM (Computer Research Journal), vol. 9, no. 5, p. 1669, Oct. 2022, DOI: 10.30865/jurikom.v9i5.5021.

S. F. Ahmed et al., “Deep Learning Modelling Techniques: Current Progress, Applications, Advantages, and Challenges,” Artificial Intelligence Review, vol. 56, no. 11, pp. 13521–13617, Nov. 2023, DOI: 10.1007/s10462-023-10466-8.

N. Ayuningtyas and W. Yustanti, “Semi-Supervised Learning for Labeling in Multi-Label Text Data Classification,” Journal of Informatics and Computer Science, vol. 06, 2024, [Online]. Available: https://ejournal.unesa.ac.id/index.php/jinacs/article/view/60655

A. F. N. Masruriyah, H. Basri, H. H. Handayani, A. Fauzi, A. R. Juwita, and D. Wahiddin, “The Rise Efficiency of Coronavirus Disease Classification Employing Feature Extraction,” Jakarta, Indonesia: IEEE, Dec. 2021. DOI: http://dx.doi.org/10.1109/ICIC54025.2021.9632914

H. H. Handayani, S. Madenda, E. P. Wibowo, T. M. Kusuma, S. Widiyanto, and A. F. N. Masruriyah, “The Best Classification Algorithm for Identifying Beef Quality Based on Marbling,” Gorontalo, Indonesia: IEEE, Dec. 2020. DOI: https:// doi.org/10.1109/ICIC50835.2020.9288624

A. F. N. Masruriyah, H. Y. Novita, C. E. Sukmawati, A. Fauzi, D. Wahiddin, and H. H. Handayani, “Thorough Evaluation of the Effectiveness of SMOTE and ADASYN Oversampling Methods in Enhancing Supervised Learning Performance for Imbalanced Heart Disease Datasets,” Manado, Indonesia: IEEE, Jan. 2024. DOI: http://dx.doi.org/10.1109/ICIC60109.2023.10382105

A. Wibowo, “Comparison of Naive Bayes Method with Support Vector Machine in Helpdesk Ticket Classification,” 2023. [Online]. Available: https://doi.org/10.30871/jaic.v7i2.6376

J. K. Lee and S. Y. Park, "Support Vector Machine for Classification," Journal of Machine Learning Research, vol. 15, pp. 123-140, 2014, DOI: 10.1007/s10994-013-5413-5.

B. Scholkopf and A. J. Smola, "Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond," MIT Press, 2002, DOI: 10.7551/mitpress/4176.001.0001.

W. Trisnawati and A. Wibowo, “Sentiment Analysis of ICT Service User Using Naive Bayes Classifier and SVM Methods With TF-IDF Text Weighting,” Journal of Informatics Engineering (JUTIF), vol. 5, no. 3, pp. 709–719, 2024, DOI: 10.52436/1.jutif.2024.5.3.1784.

M. Riyadi Maskur and A. Wibowo, “Taxpayer Awareness Classification Using Decision Tree and Naïve Bayes Methods,” 2024. [Online]. Available: https://doi.org/10.30871/jaic.v8i1.6654

M. L. Steinbach, G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques," Proceedings of the Text Mining Workshop, KDD, 2000, DOI: 10.1.1.41.9980.

C. B. Sonjaya, A. Fitri, N. Masruriyah, D. S. Kusumaningrum, and A. R. Pratama, “The Performance Comparison of Classification Algorithm for Detecting Heart Disease,” Information System Journal, vol. 5, no. 2, pp. 166–175, DOI: 10.32627/internal.v5i2.595

H. Hikmayanti, A. F. Nurmasruriyah, A. Fauzi, N. Nurjanah, and A. Nur Rani, “Performance Comparison of Support Vector Machine Algorithm and Logistic Regression Algorithm,” International Journal of Artificial Intelligence Research, vol. 7, no. 1, p. 1, 2023, DOI: 10.29099/ijair.v7i1.1.1114.

T. Fawcett, "An Introduction to ROC Analysis," Pattern Recognition Letters, vol. 27, no. 8, pp. 861-874, 2006, DOI: 10.1016/j.patrec.2005.10.010.

D. M. Powers, "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation," Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37-63, 2011, DOI: 10.48550/arXiv.2010.16061.

C. Bishop, "Pattern Recognition and Machine Learning," Springer, 2006, DOI: 10.1007/978-0-387-45528-0.

A. Rajaraman and J. D. Ullman, "Mining of Massive Datasets," Cambridge University Press, 2nd edition, 2011, DOI: 10.1017/CBO9781139058452.

Published
2025-01-11
How to Cite
[1]
A. Wibowo, A. F. N. Masruriyah, and S. Rahmawati, “Refining Diabetes Diagnosis Models: The Impact of SMOTE on SVM, Logistic Regression, and Naïve Bayes”, j.electron.electromedical.eng.med.inform, vol. 7, no. 1, pp. 197-207, Jan. 2025.
Section
Research Paper