Analysis of Important Features in Software Defect Prediction Using Synthetic Minority Oversampling Techniques (SMOTE), Recursive Feature Elimination (RFE) and Random Forest
Abstract
Software Defect Prediction (SDP) is essential for improving software quality during testing. As software systems grow more complex, accurately predicting defects becomes increasingly challenging. One of the challenges faced is dealing with imbalanced class distributions, where the number of defective instances is significantly lower than non-defective ones. To tackle the imbalanced class issue, use the SMOTE technique. Random Forest as a classification algorithm is due to its ability to handle non-linear data, its resistance to overfitting, and its ability to provide information about the importance of features in classification. This research aims to evaluate important features and measure accuracy in SDP using the SMOTE+RFE+Random Forest technique. The dataset used in this study is NASA MDP D", which included 12 data sets. The method used combines SMOTE, RFE, and random forest techniques. This study is conducted in two stages of approach. The first stage uses the RFE+Random Forest technique; the second stage involves adding the SMOTE technique before RFE and Random Forest to measure the accurate data from NASA MDP. The result of this study is that the use of the SMOTE technique enhances accuracy across most datasets, with the best performance achieved on the MC1 dataset with an accuracy of 0.9998. Feature importance analysis identifies "maintenance severity" and "cyclomatic density" as the most crucial features in data modeling for SDP. Therefore, the SMOTE+RFE+RF technique effectively improves prediction accuracy across various datasets and successfully addresses class imbalance issues.
Downloads
References
A. O. Balogun, S. Basri, S. J. Abdulkadir, and A. S. Hashim, “Performance analysis of feature selection methods in software defect prediction: A search method approach,” Applied Sciences (Switzerland), vol. 9, no. 13, Jul. 2019, doi: 10.3390/app9132764.
M. A. Mabayoje, A. O. Balogun, H. A. Jibril, J. O. Atoyebi, H. A. Mojeed, and V. E. Adeyemo, “Parameter tuning in KNN for software defect prediction: an empirical analysis,” Jurnal Teknologi dan Sistem Komputer, vol. 7, no. 4, pp. 121–126, Oct. 2019, doi: 10.14710/jtsiskom.7.4.2019.121-126.
K. K. Bejjanki, J. Gyani, and N. Gugulothu, “Class imbalance reduction (CIR): A novel approach to software defect prediction in the presence of class imbalance,” Symmetry (Basel), vol. 12, no. 3, Mar. 2020, doi: 10.3390/sym12030407.
A. Suryadi, “Integration of Feature Selection with Data Level Approach for Software Defect Prediction,” Journal Publications & Informatics Engineering Research, vol. 4, no. 1, 2019, doi: 10.33395/sinkron.v3i1.10137.
S. Mishra, “Handling Imbalanced Data: SMOTE vs. Random Undersampling,” International Research Journal of Engineering and Technology, 2017, [Online]. Available: www.irjet.net
I. de Zarzà, J. de Curtò, and C. T. Calafate, “Optimizing Neural Networks for Imbalanced Data,” Electronics (Switzerland), vol. 12, no. 12, Jun. 2023, doi: 10.3390/electronics12122674.
C. A. Ramezan, “Transferability of Recursive Feature Elimination (RFE)-Derived Feature Sets for Support Vector Machine Land Cover Classification,” Remote Sens (Basel), vol. 14, no. 24, Dec. 2022, doi: 10.3390/rs14246218.
A. O. Balogun et al., “Impact of feature selection methods on the predictive performance of software defect prediction models: An extensive empirical study,” Symmetry (Basel), vol. 12, no. 7, Jul. 2020, doi: 10.3390/sym12071147.
F. Wu, Y. Ren, and X. Wang, “Application of Multi-Source Data for Mapping Plantation Based on Random Forest Algorithm in North China,” Remote Sens (Basel), vol. 14, no. 19, Oct. 2022, doi: 10.3390/rs14194946.
Y. Zhang, T. Li, Z. Li, Y. M. Wu, and H. Miao, “Software Defects Prediction Based on Hybrid Beetle Antennae Search Algorithm and Artificial Bee Colony Algorithm with Comparison,” Axioms, vol. 11, no. 7, Jul. 2022, doi: 10.3390/axioms11070305.
Čhulālongkō̜nmahāwitthayālai. Khana Witthayāsāt, Mahāwitthayālai Būraphā. Faculty of Informatics, Institute of Electrical and Electronics Engineers, IEEE Thailand Section, and C. Electrical Engineering/Electronics, 2019 JCSSE : the 16th International Joint Conference on Computer Science and Software Engineering : “Knowledge Evolution Towards Singularity of Man-Machine Intelligence” : July 10-12, 2019, Amari Pattaya, Chonburi, Thailand.
J. J. Tanimu, M. Hamada, M. Hassan, H. A. Kakudi, and J. O. Abiodun, “A Machine Learning Method for Classification of Cervical Cancer,” Electronics (Switzerland), vol. 11, no. 3, Feb. 2022, doi: 10.3390/electronics11030463.
Y. Yuan, C. Li, and J. Yang, “An Improved Confounding Effect Model for Software Defect Prediction,” Applied Sciences (Switzerland), vol. 13, no. 6, Mar. 2023, doi: 10.3390/app13063459.
A. Ghavidel, P. Pazos, R. Del Aguila Suarez, and A. Atashi, “Predicting the Need for Cardiovascular Surgery: A Comparative Study of Machine Learning Models,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 92–106, Feb. 2024, doi: 10.35882/jeeemi.v6i2.359.
Y. F. Zamzam, T. H. Saragih, R. Herteno, Muliadi, D. T. Nugrahadi, and P. H. Huynh, “Comparison of CatBoost and Random Forest Methods for Lung Cancer Classification using Hyperparameter Tuning Bayesian Optimization-based,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 125–136, Apr. 2024, doi: 10.35882/jeeemi.v6i2.382.
K. Marzuki, L. Ganda Rady Putra, H. Hairani, L. Zazuli Azhar Mardedi, and J. Ximenes Guterres, “Performance Improvement of The Random Forest Method Based on Smote-Tomek Link on Lombok Tourism Analysis Sentiment,” Jurnal Bumigora Information Technology (BITe), vol. 5, no. 2, pp. 151–158, 2023, doi: 10.30812/bite/v5i1.3166.
H. Shi, J. Ai, J. Liu, and J. Xu, “Improving Software Defect Prediction in Noisy Imbalanced Datasets,” Applied Sciences (Switzerland), vol. 13, no. 18, Sep. 2023, doi: 10.3390/app131810466.
A. J. Mohammed, “Improving Classification Performance for a Novel Imbalanced Medical Dataset using SMOTE Method,” International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 3, pp. 3161–3172, Jun. 2020, doi: 10.30534/ijatcse/2020/104932020.
M. Alrumaidhi, M. M. G. Farag, and H. A. Rakha, “Comparative Analysis of Parametric and Non-Parametric Data-Driven Models to Predict Road Crash Severity among Elderly Drivers Using Synthetic Resampling Techniques,” Sustainability (Switzerland), vol. 15, no. 13, Jul. 2023, doi: 10.3390/su15139878.
Z. Liang, L. Zhang, and X. Wang, “A Novel Intelligent Method for Fault Diagnosis of Steam Turbines Based on T-SNE and XGBoost,” Algorithms, vol. 16, no. 2, Feb. 2023, doi: 10.3390/a16020098.
N. Anđelić, I. Lorencin, S. Baressi Šegota, and Z. Car, “The Development of Symbolic Expressions for the Detection of Hepatitis C Patients and the Disease Progression from Blood Parameters Using Genetic Programming-Symbolic Classification Algorithm,” Applied Sciences (Switzerland), vol. 13, no. 1, Jan. 2023, doi: 10.3390/app13010574.
N. Anđelić, S. Baressi Šegota, I. Lorencin, and M. Glučina, “Detection of Malicious Websites Using Symbolic Classifier,” Future Internet, vol. 14, no. 12, Dec. 2022, doi: 10.3390/fi14120358.
Angga Maulana Akbar, R. Herteno, S. W. Saputro, M. R. Faisal, and R. A. Nugroho, “Optimizing Software Defect Prediction Models: Integrating Hybrid Grey Wolf and Particle Swarm Optimization for Enhanced Feature Selection with Popular Gradient Boosting Algorithm,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 169–181, Apr. 2024, doi: 10.35882/jeeemi.v6i2.388.
L. Zhang, Y. Liu, J. Zhou, M. Luo, S. Pu, and X. Yang, “An Imbalanced Fault Diagnosis Method Based on TFFO and CNN for Rotating Machinery,” Sensors, vol. 22, no. 22, Nov. 2022, doi: 10.3390/s22228749.
A. A. Hussin Adam Khatir and M. Bee, “Machine Learning Models and Data-Balancing Techniques for Credit Scoring: What Is the Best Combination?,” Risks, vol. 10, no. 9, Sep. 2022, doi: 10.3390/risks10090169.
A. M. de Carvalho and R. C. Prati, “DTO-SMOTE: Delaunay tessellation oversampling for imbalanced data sets,” Information (Switzerland), vol. 11, no. 12, pp. 1–22, Dec. 2020, doi: 10.3390/info11120557.
S. Rout, P. K. Mallick, A. V. N. Reddy, and S. Kumar, “A Tailored Particle Swarm and Egyptian Vulture Optimization-Based Synthetic Minority-Oversampling Technique for Class Imbalance Problem,” Information (Switzerland), vol. 13, no. 8, Aug. 2022, doi: 10.3390/info13080386.
G. Alfian et al., “Deep neural network for predicting diabetic retinopathy from risk factors,” Mathematics, vol. 8, no. 9, Sep. 2020, doi: 10.3390/math8091620.
N. Zhang et al., “Forest Height Mapping Using Feature Selection and Machine Learning by Integrating Multi-Source Satellite Data in Baoding City, North China,” Remote Sens (Basel), vol. 14, no. 18, Sep. 2022, doi: 10.3390/rs14184434.
R. C. Chen, W. E. Manongga, and C. Dewi, “Recursive Feature Elimination for Improving Learning Points on Hand-Sign Recognition,” Future Internet, vol. 14, no. 12, Dec. 2022, doi: 10.3390/fi14120352.
X. Fan et al., “Sentinel-2 Images Based Modeling of Grassland Above-Ground Biomass Using Random Forest Algorithm: A Case Study on the Tibetan Plateau,” Remote Sens (Basel), vol. 14, no. 21, Nov. 2022, doi: 10.3390/rs14215321.
M. A. Kabir, S. Begum, M. U. Ahmed, and A. U. Rehman, “CODE: A Moving-Window-Based Framework for Detecting Concept Drift in Software Defect Prediction,” Symmetry (Basel), vol. 14, no. 12, Dec. 2022, doi: 10.3390/sym14122508.
Z. Li, X. Guan, K. Zou, and C. Xu, “Estimation of knee movement from surface emg using random forest with principal component analysis,” Electronics (Switzerland), vol. 9, no. 1, Jan. 2020, doi: 10.3390/electronics9010043.
R. De Fazio, R. Di Giovannantonio, E. Bellini, and S. Marrone, “Explainabilty Comparison between Random Forests and Neural Networks—Case Study of Amino Acid Volume Prediction,” Information (Switzerland), vol. 14, no. 1, Jan. 2023, doi: 10.3390/info14010021.
N. H. Arif, M. R. Faisal, A. Farmadi, D. T. Nugrahadi, F. Abadi, and U. A. Ahmad, “An Approach to ECG-based Gender Recognition Using Random Forest Algorithm,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 107–115, Apr. 2024, doi: 10.35882/jeeemi.v6i2.363.
Y. Wang, Y. Li, Y. Song, and X. Rong, “Facial expression recognition based on random forest and convolutional neural network,” Information (Switzerland), vol. 10, no. 12, Dec. 2019, doi: 10.3390/info10120375.
M. K. Suryadi, R. Herteno, S. W. Saputro, M. R. Faisal, and R. A. Nugroho, “A Comparative Study of Various Hyperparameter Tuning on Random Forest Classification with SMOTE and Feature Selection Using Genetic Algorithm in Software Defect Prediction,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 137–147, Apr. 2024, doi: 10.35882/jeeemi.v6i2.375.
I. Ul Hassan, R. H. Ali, Z. Ul Abideen, T. A. Khan, and R. Kouatly, “Significance of Machine Learning for Detection of Malicious Websites on an Unbalanced Dataset,” Digital, vol. 2, no. 4, pp. 501–519, Dec. 2022, doi: 10.3390/digital2040027.
A. Alqarni and H. Aljamaan, “Leveraging Ensemble Learning with Generative Adversarial Networks for Imbalanced Software Defects Prediction,” Applied Sciences, vol. 13, no. 24, p. 13319, Dec. 2023, doi: 10.3390/app132413319.
S. M. A. Shah et al., “An Ensemble Model for Consumer Emotion Prediction Using EEG Signals for Neuromarketing Applications,” Sensors, vol. 22, no. 24, Dec. 2022, doi: 10.3390/s22249744.
A. Alsaeedi and M. Z. Khan, “Software Defect Prediction Using Supervised Machine Learning and Ensemble Techniques: A Comparative Study,” Journal of Software Engineering and Applications, vol. 12, no. 05, pp. 85–100, 2019, doi: 10.4236/jsea.2019.125007.
N. Z. Al Habesyah, R. Herteno, F. Indriani, I. Budiman, and D. Kartini, “Sentiment Analysis of TikTok Shop Closure in Indonesia on Twitter Using Supervised Machine Learning,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 148–156, Apr. 2024, doi: 10.35882/jeeemi.v6i2.381.
N. Anđelić, S. Baressi Šegota, I. Lorencin, and Z. Car, “The Development of Symbolic Expressions for Fire Detection with Symbolic Classifier Using Sensor Fusion Data,” Sensors, vol. 23, no. 1, Jan. 2023, doi: 10.3390/s23010169.
Copyright (c) 2024 Helma Ghinaya, Rudy Herteno, Mohammad Reza Faisal, Andi Farmadi, Fatma Indriani

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).