Analysis of the Impact of Data Oversampling on the Support Vector Machine Method for Stroke Disease Classification

Luh Ayu Martini; Gede Angga Pradipta; Roy Rudolf Huizen

doi:10.35882/jeeemi.v7i2.698

Luh Ayu Martini Department of Magister Information Systems, Institut Teknlogi dan Bisnis STIKOM Bali, Bali, Indonesia https://orcid.org/0009-0001-5868-2409
Gede Angga Pradipta Department of Magister Information Systems, Institut Teknlogi dan Bisnis STIKOM Bali, Bali, Indonesia https://orcid.org/0000-0002-6087-619X
Roy Rudolf Huizen Department of Magister Information Systems, Institut Teknlogi dan Bisnis STIKOM Bali, Bali, Indonesia https://orcid.org/0000-0002-3671-6030

DOI: https://doi.org/10.35882/jeeemi.v7i2.698

Keywords: Stroke, Machine Learning, Imbalanced Data, Oversampling, Feature Selection, Support Vector Machine.

Abstract

Data imbalance is a critical challenge in the classification of medical data, particularly in stroke disease prediction, a life-threatening condition requiring immediate intervention. This imbalance arises due to the disproportionate number of non-stroke cases compared to stroke cases, which can lead to biased models favoring the majority class. Consequently, the model may struggle to correctly identify stroke cases, resulting in lower recall and an increased risk of misdiagnosis. This study evaluates the impact of various oversampling techniques, including Synthetic Minority Over-sampling Technique (SMOTE), Borderline-SMOTE, SMOTE-Edited Nearest Neighbor (SMOTE-ENN), and SMOTE-Instance Prototypes Filtering (SMOTE-IPF), along with feature selection using Information Gain and Chi-Square, to assess their influence on model performance. Oversampling is utilized to address class imbalance by generating synthetic samples, thereby improving the representation of the minority class. Feature selection is employed to eliminate irrelevant or redundant features, enhancing both interpretability and computational efficiency. The dataset obtained from Kaggle, consists of 5,110 records and 12 features. Support Vector Machine (SVM) is used as the classification algorithm, with evaluations conducted on Linear, Radial Basis Function (RBF), and Polynomial kernels. Experimental results indicate that the highest performance is achieved by the combination of Borderline-SMOTE and the RBF kernel, yielding an accuracy of 96.86%, precision of 98.65%, recall of 94.99%, and an F1-score of 96.79%. This model outperforms others in stroke disease classification, demonstrating that the integration of oversampling techniques can effectively enhance prediction accuracy. Future research could focus on implementing deep learning-based models to further optimize stroke classification in the case of imbalanced data. These advancements are expected to enhance model performance, leading to a more effective and efficient approach for medical datasets.

Downloads

Download data is not yet available.

References

G. Sailasya and G. L. A. Kumari, “Analyzing the Performance of Stroke Prediction using ML Classification Algorithms,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 6, pp. 539–545, 2021, doi: 10.14569/IJACSA.2021.0120662.

P. Bathla and R. Kumar, “A Hybrid System To Predict Brain Stroke Using A Combined Feature Selection And Classifier,” Intell. Med., vol. 4, no. April 2023, pp. 75–82, 2024, doi: 10.1016/j.imed.2023.06.002.

M. M. Chowdhury, R. S. Ayon, and M. S. Hossain, “An Investigation Of Machine Learning Algorithms And Data Augmentation Techniques For Diabetes Diagnosis Using Class Imbalanced BRFSS Dataset,” Healthc. Anal., vol. 5, no. December 2023, p. 100297, 2024, doi: 10.1016/j.health.2023.100297.

T. G.S., Y. Hariprasad, S. S. Iyengar, N. R. Sunitha, P. Badrinath, and S. Chennupati, “An extension of Synthetic Minority Oversampling Technique based on Kalman filter for imbalanced datasets,” Machine Learning with Applications, vol. 8. p. 100267, 2022, doi: 10.1016/j.mlwa.2022.100267.

K. Iscra et al., “Optimizing machine learning models for classification of stroke patients with epileptiform EEG pattern : the impact of dataset balancing techniques,” vol. 00, 2024, doi: 10.1016/j.procs.2024.09.324.

S. Paliwal, S. Parveen, M. A. Alam, and J. Ahmed, “Improving Brain Stroke Prediction through Oversampling Techniques: A Comparative Evaluation of Machine Learning Algorithms,” Preprints, vol. 44, no. 6, pp. 1484–1502, 2023, [Online]. Available: www.preprints.org.

S. Saha and K. Nur, “Prediction of Stroke Disease Using Deep CNN Based Approach,” no. January 2022, 2023, doi: 10.12720/jait.13.6.604-613.

T. Swathi Priyadarshini and M. A. Hameed, “Collaboration Of Clustering And Classification Techniques For Better Prediction Of Severity Of Heart Stroke Using Deep Learning,” Meas. Sensors, vol. 37, no. September 2024, p. 101405, 2025, doi: 10.1016/j.measen.2024.101405.

Y. He et al., “Construction of a machine learning-based prediction model for unfavorable discharge outcomes in patients with ischemic stroke,” Heliyon, vol. 10, no. 17, p. e37179, 2024, doi: 10.1016/j.heliyon.2024.e37179.

V. P. Prasetyo, M. F. A. Ulin Nuha, M. H. Hakiki, R. A. Vinarti, and A. Djunaidy, “Comparison of Data Mining Techniques on Stroke Clinical Dataset,” Procedia Comput. Sci., vol. 234, pp. 502–511, 2024, doi: 10.1016/j.procs.2024.03.033.

S. Sahriar et al., “Unlocking Stroke Prediction: Harnessing Projection-Based Statistical Feature Extraction With ML Algorithms,” Heliyon, vol. 10, no. 5, p. e27411, 2024, doi: 10.1016/j.heliyon.2024.e27411.

F. Fachruddin, E. Rasywir, and Y. Pratama, “Increasing the Accuracy of Brain Stroke Classification using Random Forest Algorithm with Mutual Information Feature Selection,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 8, no. 4, pp. 555–562, 2024.

Z. Rustam, Arfiani, and J. Pandelaki, “Cerebral Infarction Classification Using Multiple Support Vector Machine With Information Gain Feature Selection,” Bull. Electr. Eng. Informatics, vol. 9, no. 4, pp. 1578–1584, 2020, doi: 10.11591/eei.v9i4.1997.

N. Nasution, F. Nasution, E. Erlin, and M. Hasan, “Evaluation Study of the Chi-Square Method for Feature Selection in Stroke Prediction with Random Forest Regression,” 2024, doi: 10.4108/eai.30-10-2023.2343096.

U. N. Wisesty, T. Agung, B. Wirayuda, F. Sthevanie, and R. Rismala, “Analysis of Data and Feature Processing on Stroke Prediction using Wide Range Machine Learning Model,” vol. 9, no. 1, pp. 29–40, 2024, doi: 10.15575/join.v9i1.1249.

S. Ray, “Chi-Squared Based Feature Selection for Stroke Prediction using AzureML,” 2020.

J. Gao and G. Zhang, “Tennis action recognition and evaluation with inertial measurement unit and SVM,” Systems and Soft Computing, vol. 6. 2024, doi: 10.1016/j.sasc.2024.200154.

R. Gholami and N. Fakhari, "Support Vector Machine: Principles, Parameters, and Applications," 1st ed. Elsevier Inc., 2017. doi: 10.1016/B978-0-12-811318-9.00027-2.

D. Fitria, T. H. Saragih, D. Kartini, and F. Indriani, “Classification of Appendicitis in Children Using SVM with KNN Imputation and SMOTE Approach to Improve Prediction Quality,” vol. 6, no. 3, pp. 302–311, 2024.

T. O. Omotehinwa, D. O. Oyewola, and E. G. Moung, “Optimizing the light gradient-boosting machine algorithm for an efficient early detection of coronary heart disease,” Informatics Heal., vol. 1, no. 2, pp. 70–81, 2024, doi: 10.1016/j.infoh.2024.06.001.

A. Tharwat, T. Gabel, and A. E. Hassanien, “Classification Of Toxicity Effects Of Biotransformed Hepatic Drugs Using Optimized Support Vector Machine,” Adv. Intell. Syst. Comput., vol. 639, pp. 161–170, 2018, doi: 10.1007/978-3-319-64861-3_15.

Y. Han and I. Joe, “Enhancing Machine Learning Models Through PCA, SMOTE-ENN, and Stochastic Weighted Averaging,” Appl. Sci., vol. 14, no. 21, 2024, doi: 10.3390/app14219772.

G. A. Pradipta and Putu Desiana Wulaning Ayu, “Kombinasi Inisial Filtering Oversampling dengan Metode Ensemble Classifier pada Klasifikasi Data Imbalanced,” J. Sist. dan Inform., vol. 17, no. 2, pp. 137–145, 2023, doi: 10.30864/jsi.v17i2.591.

S. Tangirala, “Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm,” Int. J. Adv. Comput. Sci. Appl., no. 2, pp. 612–619, 2020, doi: 10.14569/ijacsa.2020.0110277.

M. Mahmud, I. Budiman, F. Indriani, D. Kartini, and M. R. Faisal, “Implementation of C5.0 Algorithm using Chi-Square Feature Selection for Early Detection of Hepatitis C Disease,” vol. 6, no. 2, pp. 116–124, 2024.

D. Valero-Carreras, J. Alcaraz, and M. Landete, “Comparing Two SVM Models Through Different Metrics Based On The Confusion Matrix,” Comput. Oper. Res., vol. 152, no. December 2022, p. 106131, 2023, doi: 10.1016/j.cor.2022.106131.

A. K. P. Anil and U. K. Singh, “An Optimal Solution to the Overfitting and Underfitting Problem of Healthcare Machine Learning Models,” J. Syst. Eng. Inf. Technol., vol. 2, no. 2, pp. 77–84, 2023, doi: 10.29207/joseit.v2i2.5460.

W. Sun, Z. Cai, and X. Chen, “Region-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning,” Commun. Comput. Inf. Sci., vol. 1944 CCIS, pp. 151–160, 2024, doi: 10.1007/978-981-99-7743-7_9.

A. Glazkova, “A Comparison of Synthetic Oversampling Methods for Multi-class Text Classification,” no. 18, pp. 1–12, 2020, [Online]. Available: http://arxiv.org/abs/2008.04636.

J. Brandt and E. Lanzen, “A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification,” p. 42, 2020.

D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for Handling Class Imbalance,” Inf. Sci. (Ny)., vol. 505, pp. 32–64, 2019, doi: 10.1016/j.ins.2019.07.070.

H. Zhang, “Stroke Prediction Based on Support Vector Machine,” Highlights Sci. Eng. Technol., vol. 31, pp. 53–59, 2023, doi: 10.54097/hset.v31i.4812.

Y. Feng, “Support Vector Machine for Stroke Risk Prediction,” Highlights Sci. Eng. Technol., vol. 38, pp. 917–923, 2023, doi: 10.54097/hset.v38i.5977.

A. Gupta et al., “Predicting stroke risk: An Effective Stroke Prediction Model Based On Neural Networks,” J. Neurorestoratology, vol. 13, no. 1, p. 100156, 2024, doi: 10.1016/j.jnrt.2024.100156.

W. J. Sari et al., “Performance Comparison of Random Forest, Support Vector Machine and Neural Network in Health Classification of Stroke Patients,” Public Res. J. Eng. Data Technol. Comput. Sci., vol. 2, no. 1, pp. 34–43, 2024, doi: 10.57152/predatecs.v2i1.1119.

A. S. Hermiati, R. Herteno, F. Indriani, T. H. Saragih, Muliadi, and Triwiyanto, “A Comparative Study: Application of Principal Component Analysis and Recursive Feature Elimination in Machine Learning for Stroke Prediction,” J. Electron. Electromed. Eng. Med. Informatics, vol. 6, no. 3, pp. 231–242, 2024, doi: 10.35882/jeeemi.v6i3.446.