A Comparative Study: Application of Principal Component Analysis and Recursive Feature Elimination in Machine Learning for Stroke Prediction

Arya Syifa Hermiati; Rudy  Herteno; Fatma Indriani; Triando Hamonangan Saragih; Muliadi; Triwiyanto Triwiyanto

doi:10.35882/jeeemi.v6i3.446

Arya Syifa Hermiati Department of Computer Science, Lambung Mangkurat University, Banjarbaru, Indonesia https://orcid.org/0009-0004-9168-6809
Rudy Herteno Department of Computer Science, Lambung Mangkurat University, Banjarbaru, Indonesia https://orcid.org/0000-0003-0637-8090
Fatma Indriani Department of Computer Science, Lambung Mangkurat University, Banjarbaru, Indonesia https://orcid.org/0009-0006-7180-6708
Triando Hamonangan Saragih Department of Computer Science, Lambung Mangkurat University, Banjarbaru, Indonesia https://orcid.org/0000-0003-4346-3323
Muliadi Department of Computer Science, Lambung Mangkurat University, Banjarbaru, Indonesia https://orcid.org/0000-0003-2871-9482
Triwiyanto Triwiyanto Department of Medical Electronics Technology, Poltekkes Kemenkes Surabaya, Indonesia https://orcid.org/0000-0003-3179-8900

DOI: https://doi.org/10.35882/jeeemi.v6i3.446

Keywords: Recursive Feature Elimination, Principal Component Analysis, Support Vector Machine, Random Forest, Naive Bayes, Linear Discriminant Analysis

Abstract

Stroke is a disease that occurs in the brain and can cause both vocal and global brain dysfunction. Stroke research mainly aims to predict risk and mortality. Machine learning can be used to diagnose and predict diseases in the healthcare field, especially in stroke prediction. However, collecting medical record data to predict a disease usually makes much noise because not all variables are important and relevant to the prediction process. In this case, dimensionality reduction is essential to remove noisy (i.e., irrelevant) and redundant features. This study aims to predict stroke using Recursive Feature Elimination as feature selection, Principal Component Analysis as feature extraction, and a combination of Recursive Feature Elimination and Principal Component Analysis. The dataset used in this research is stroke prediction from Kaggle. The research methodology consists of pre-processing, SMOTE, 10-fold Cross-Validation, feature selection, feature extraction, and machine learning, which includes SVM, Random Forest, Naive Bayes, and Linear Discriminant Analysis. From the results obtained, the SVM and Random Forest get the highest accuracy value of 0.8775 and 0.9511 without using PCA and RFE, Naive Bayes gets the highest value of 0.7685 when going through PCA with selection of 20 features followed by RFE feature selection with selection of 5 features, and LDA gets the highest accuracy with 20 features from feature selection and continued feature extraction with a value of 0. 7963. It can be concluded in this study that SVM and Random Forest get the highest accuracy value without PCA and RFE techniques, while Naive Bayes and LDA show better performance using a combination of PCA and RFE techniques. The implication of this research is to know the effect of RFE and PCA on machine learning to improve stroke prediction.

Downloads

Download data is not yet available.

References

V. L. Feigin et al., “Global, regional, and national burden of stroke and its risk factors, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019,” Lancet Neurol, vol. 20, no. 10, pp. 795–820, Oct. 2021, doi: 10.1016/S1474-4422(21)00252-0.

S. J. X. Murphy and D. J. Werring, “Stroke: causes and clinical features,” Medicine, vol. 48, no. 9, pp. 561–566, 2020.

D. Frank, A. Zlotnik, M. Boyko, and B. F. Gruenbaum, “The Development of Novel Drug Treatments for Stroke Patients: A Review,” Int J Mol Sci, vol. 23, no. 10, p. 5796, May 2022, doi: 10.3390/ijms23105796.

B. C. V. Campbell et al., “Ischaemic stroke,” Nat Rev Dis Primers, vol. 5, no. 1, p. 70, Oct. 2019, doi: 10.1038/s41572-019-0118-8.

GBD 2019 Diseases and Injuries Collaborators, “Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019,” The Lancet, vol. 396, no. 10258, pp. 1204–1222, 2020, doi: 10.1016/S0140-6736(20)30925-9.

A. Pfob, S.-C. Lu, and C. Sidey-Gibbons, “Machine learning in medicine: a practical introduction to techniques for data pre-processing, hyperparameter tuning, and model comparison,” BMC Med Res Methodol, vol. 22, no. 1, p. 282, Nov. 2022, doi: 10.1186/s12874-022-01758-8.

J. A. M. Sidey-Gibbons and C. J. Sidey-Gibbons, “Machine learning in medicine: a practical introduction,” BMC Med Res Methodol, vol. 19, no. 1, p. 64, Dec. 2019, doi: 10.1186/s12874-019-0681-4.

I. G. Ivanov, Y. Kumchev, and V. J. Hooper, “An Optimization Precise Model of Stroke Data to Improve Stroke Prediction,” Algorithms, vol. 16, no. 9, p. 417, Sep. 2023, doi: 10.3390/a16090417.

G. Sailasya and G. L. A. Kumari, “Analyzing the Performance of Stroke Prediction using ML Classification Algorithms,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 6, 2021, doi: 10.14569/IJACSA.2021.0120662.

G. T. Reddy et al., “Analysis of Dimensionality Reduction Techniques on Big Data,” IEEE Access, vol. 8, pp. 54776–54788, 2020, doi: 10.1109/ACCESS.2020.2980942.

B. Remeseiro and V. Bolon-Canedo, “A review of feature selection methods in medical applications,” Comput Biol Med, vol. 112, p. 103375, Sep. 2019, doi: 10.1016/j.compbiomed.2019.103375.

C. Yumeng and F. Yinglan, “Research on PCA Data Dimension Reduction Algorithm Based on Entropy Weight Method,” in 2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), IEEE, Oct. 2020, pp. 392–396. doi: 10.1109/MLBDBI51377.2020.00084.

F. E. Bezerra et al., “Impacts of Feature Selection on Predicting Machine Failures by Machine Learning Algorithms,” Applied Sciences, vol. 14, no. 8, p. 3337, Apr. 2024, doi: 10.3390/app14083337.

B. Zhang, X. Dong, Y. Hu, X. Jiang, and G. Li, “Classification and prediction of spinal disease based on the SMOTE-RFE-XGBoost model,” PeerJ Comput Sci, vol. 9, p. e1280, Mar. 2023, doi: 10.7717/peerj-cs.1280.

E. M. Senan et al., “Diagnosis of Chronic Kidney Disease Using Effective Classification Algorithms and Recursive Feature Elimination Techniques,” J Healthc Eng, vol. 2021, pp. 1–10, Jun. 2021, doi: 10.1155/2021/1004767.

J. Ma and Y. Yuan, “Dimension reduction of image deep feature using PCA,” J Vis Commun Image Represent, vol. 63, p. 102578, Aug. 2019, doi: 10.1016/j.jvcir.2019.102578.

C. Zhu, C. U. Idemudia, and W. Feng, “Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques,” Inform Med Unlocked, vol. 17, p. 100179, 2019, doi: 10.1016/j.imu.2019.100179.

P. N. Srinivasu, U. Sirisha, K. Sandeep, S. P. Praveen, L. P. Maguluri, and T. Bikku, “An Interpretable Approach with Explainable AI for Heart Stroke Prediction,” Diagnostics, vol. 14, no. 2, p. 128, Jan. 2024, doi: 10.3390/diagnostics14020128.

M. Alruily, S. A. El-Ghany, A. M. Mostafa, M. Ezz, and A. A. A. El-Aziz, “A-Tuning Ensemble Machine Learning Technique for Cerebral Stroke Prediction,” Applied Sciences, vol. 13, no. 8, p. 5047, Apr. 2023, doi: 10.3390/app13085047.

N. Alageel, R. Alharbi, R. Alharbi, M. Alsayil, and L. A. Alharbi, “Using Machine Learning Algorithm as a Method for Improving Stroke Prediction,” International Journal of Advanced Computer Science and Applications, vol. 14, no. 4, 2023, doi: 10.14569/IJACSA.2023.0140481.

N. Nezami, P. Haghighat, D. Gándara, and H. Anahideh, “Assessing Disparities in Predictive Modeling Outcomes for College Student Success: The Impact of Imputation Techniques on Model Performance and Fairness,” Educ Sci (Basel), vol. 14, no. 2, p. 136, Jan. 2024, doi: 10.3390/educsci14020136.

A. Palanivinayagam and R. Damaševičius, “Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods,” Information, vol. 14, no. 2, p. 92, Feb. 2023, doi: 10.3390/info14020092.

M. Alruily, S. A. El-Ghany, A. M. Mostafa, M. Ezz, and A. A. A. El-Aziz, “A-Tuning Ensemble Machine Learning Technique for Cerebral Stroke Prediction,” Applied Sciences, vol. 13, no. 8, p. 5047, Apr. 2023, doi: 10.3390/app13085047.

P. Kirichenko, P. Izmailov, and A. G. Wilson, “Last layer re-training is sufficient for robustness to spurious correlations,” arXiv preprint arXiv:2204.02937, 2022.

C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models,” IEEE Transactions on Software Engineering, vol. 46, no. 11, pp. 1200–1219, Nov. 2020, doi: 10.1109/TSE.2018.2876537.

M. K. Suryadi, R. Herteno, S. W. Saputro, M. R. Faisal, and R. A. Nugroho, “Comparative Study of Various Hyperparameter Tuning on Random Forest Classification With SMOTE and Feature Selection Using Genetic Algorithm in Software Defect Prediction,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, Mar. 2024, doi: 10.35882/jeeemi.v6i2.375.

R. T. Yunardi, R. Apsari, and M. Yasin, “Comparison of Machine Learning Algorithm For Urine Glucose Level Classification Using Side-Polished Fiber Sensor,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 2, no. 2, pp. 33–39, Jul. 2020, doi: 10.35882/jeeemi.v2i2.1.

H. Wei, C. Hu, S. Chen, Y. Xue, and Q. Zhang, “Establishing a software defect prediction model via effective dimension reduction,” Inf Sci (N Y), vol. 477, pp. 399–409, Mar. 2019, doi: 10.1016/j.ins.2018.10.056.

S. A. Sontakke, J. Lohokare, R. Dani, and P. Shivagaje, “Classification of Cardiotocography Signals Using Machine Learning,” 2019, pp. 439–450. doi: 10.1007/978-3-030-01057-7_35.

S. Kilmen and O. Bulut, “Scale Abbreviation with Recursive Feature Elimination and Genetic Algorithms: An Illustration with the Test Emotions Questionnaire,” Information, vol. 14, no. 2, p. 63, Jan. 2023, doi: 10.3390/info14020063.

A. Taner, M. T. Mengstu, K. Ç. Selvi, H. Duran, İ. Gür, and N. Ungureanu, “Apple Varieties Classification Using Deep Features and Machine Learning,” Agriculture, vol. 14, no. 2, p. 252, Feb. 2024, doi: 10.3390/agriculture14020252.

M. R. Mahmoudi, M. H. Heydari, S. N. Qasem, A. Mosavi, and S. S. Band, “Principal component analysis to study the relations between the spread rates of COVID-19 in high risks countries,” Alexandria Engineering Journal, vol. 60, no. 1, pp. 457–464, Feb. 2021, doi: 10.1016/j.aej.2020.09.013.

S. Ma, W. Cao, S. Jiang, J. Hu, X. Lei, and X. Xiong, “Design and implementation of SVM OTPC searching based on Shared Dot Product Matrix,” Integration, vol. 71, pp. 30–37, Mar. 2020, doi: 10.1016/j.vlsi.2019.11.007.

V. Umarani, A. Julian, and J. Deepa, “Sentiment Analysis using various Machine Learning and Deep Learning Techniques,” Journal of the Nigerian Society of Physical Sciences, pp. 385–394, Nov. 2021, doi: 10.46481/jnsps.2021.308.

J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua, and A. Lopez, “A comprehensive survey on support vector machine classification: Applications, challenges and trends,” Neurocomputing, vol. 408, pp. 189–215, Sep. 2020, doi: 10.1016/j.neucom.2019.10.118.

B. Richhariya, M. Tanveer, and A. H. Rashid, “Diagnosis of Alzheimer’s disease using universum support vector machine based recursive feature elimination (USVM-RFE),” Biomed Signal Process Control, vol. 59, p. 101903, May 2020, doi: 10.1016/j.bspc.2020.101903.

B. Gaye, D. Zhang, and A. Wulamu, “Improvement of Support Vector Machine Algorithm in Big Data Background,” Math Probl Eng, vol. 2021, pp. 1–9, Jun. 2021, doi: 10.1155/2021/5594899.

N. H. Arif, M. R. Faisal, A. Farmadi, D. Nugrahadi, F. Abadi, and U. A. Ahmad, “An Approach to ECG-based Gender Recognition Using Random Forest Algorithm,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, Mar. 2024, doi: 10.35882/jeeemi.v6i2.363.

I. Yoo, J. Bi, and X. Hu, “2019 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2019, San Diego, CA, USA, November 18-21, 2019,” I. Yoo, J. Bi, and X. Hu, Eds., IEEE, 2019. [Online]. Available: https://ieeexplore.ieee.org/xpl/conhome/8965270/proceeding

S. Bhanumathi and S. N. Dr. Chandrashekara, “Impute, Select, Decision Tree and Naïve Bayes (ISE-DNC): An Ensemble Learning Approach to Classify the Lung Cancer,” SSRN Electronic Journal, 2020, doi: 10.2139/ssrn.3667438.

P. Sandhya, V. Spoorthy, S. G. Koolagudi, and N. V. Sobhana, “Spectral Features for Emotional Speaker Recognition,” in 2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC), IEEE, Dec. 2020, pp. 1–6. doi: 10.1109/ICAECC50550.2020.9339502.

T. T. A. Putri, S. Sriadhi, R. D. Sari, R. Rahmadani, and H. D. Hutahaean, “A comparison of classification algorithms for hate speech detection,” IOP Conf Ser Mater Sci Eng, vol. 830, no. 3, p. 032006, Apr. 2020, doi: 10.1088/1757-899X/830/3/032006.

P. Boedeker and N. T. Kearns, “Linear Discriminant Analysis for Prediction of Group Membership: A User-Friendly Primer,” Adv Methods Pract Psychol Sci, vol. 2, no. 3, pp. 250–263, Sep. 2019, doi: 10.1177/2515245919849378.

C. Ricciardi et al., “Linear discriminant analysis and principal component analysis to predict coronary artery disease,” Health Informatics J, vol. 26, no. 3, pp. 2181–2192, Sep. 2020, doi: 10.1177/1460458219899210.

D. Valero-Carreras, J. Alcaraz, and M. Landete, “Comparing two SVM models through different metrics based on the confusion matrix,” Comput Oper Res, vol. 152, p. 106131, Apr. 2023, doi: 10.1016/j.cor.2022.106131.

M. Ojala and G. C. Garriga, “Permutation Tests for Studying Classifier Performance,” Journal of Machine Learning Research, vol. 11, no. 62, pp. 1833–1863, 2010, [Online]. Available: http://jmlr.org/papers/v11/ojala10a.html

N. Thanh Nhu, D. Y.-T. Chen, and J.-H. Kang, “Identification of Resting-State Network Functional Connectivity and Brain Structural Signatures in Fibromyalgia Using a Machine Learning Approach,” Biomedicines, vol. 10, no. 12, p. 3002, Nov. 2022, doi: 10.3390/biomedicines10123002.

T. R. Gadekallu, N. Khare, S. Bhattacharya, S. Singh, P. K. R. Maddikunta, and G. Srivastava, “Deep neural networks to predict diabetic retinopathy,” J Ambient Intell Humaniz Comput, vol. 14, no. 5, pp. 5407–5420, May 2023, doi: 10.1007/s12652-020-01963-7.

S. Cheon, J. Kim, and J. Lim, “The Use of Deep Learning to Predict Stroke Patient Mortality,” Int J Environ Res Public Health, vol. 16, no. 11, p. 1876, May 2019, doi: 10.3390/ijerph16111876.

Md. Ashrafuzzaman, S. Saha, and K. Nur, “Prediction of Stroke Disease Using Deep CNN Based Approach,” Journal of Advances in Information Technology, vol. 13, no. 6, 2022, doi: 10.12720/jait.13.6.604-613.

P. A. Riadi, M. R. Faisal, D. Kartini, R. A. Nugroho, D. T. Nugrahadi, and D. B. Magfira, “A Comparative Study of Machine Learning Methods for Baby Cry Detection Using MFCC Features,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 1, Jan. 2024, doi: 10.35882/jeeemi.v6i1.350.

N. Z. Al Habesyah, R. Herteno, F. Indriani, I. Budiman, and D. Kartini, “Sentiment Analysis of TikTok Shop Closure in Indonesia on Twitter Using Supervised Machine Learning,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, Apr. 2024, doi: 10.35882/jeeemi.v6i2.381.