Optimization of Backward Elimination for Software Defect Prediction with Correlation Coefficient Filter Method
Abstract
Detecting software defects is a crucial step for software development not only to reduce cost and save time, but also to mitigate more costly losses. Backward Elimination is one method for detecting software defects. Notably Backward Elimination may remove features that may later become significant to the outcome affecting the performance of Backward Elimination. The aim of this study is to improve Backward Elimination performance. In this study, several features were selected based on their correlation coefficient, with the selected feature applied to improve Backward Elimination final model performance. The final model was validated using cross validation with Naïve Bayes as the classification method on the NASA MDP dataset to determine the accuracy and Area Under the Curve (AUC) of the final model. Using top 10 correlation feature and Backward Elimination achieve an average result of 86.6% accuracy and 0.797 AUC, while using top 20 correlation feature and Backward Elimination achieved an average result of 84% accuracy and 0.812 AUC. Compare to using Backward Elimination and Naïve Bayes respectively the improvement using top 10 correlation feature as follows: AUC:1.52%, 13.53% and Accuracy: 13%, 12.4% while the improvement using top 20 correlation feature as follows: AUC:3.43%, 15.66% and Accuracy: 10.4%, 9.8%. Results showed that selecting the top 10 and top 20 feature based on its correlation before using Backward Elimination have better result than only using Backward Elimination. This result shows that combining Backward Elimination with correlation coefficient feature selection does improve Backward Elimination’s final model and yielding good results for detecting software defects.
Downloads
References
[2] M. K. Thota, F. H. Shajin, and P. Rajesh, “Survey on software defect prediction techniques,” International Journal of Applied Science and Engineering, vol. 17, no. 4, pp. 331–344, Jan. 2020, doi: 10.6703/IJASE.202012_17(4).331.
[3] A. Iqbal et al., “Performance analysis of machine learning techniques on software defect prediction using NASA datasets,” International Journal of Advanced Computer Science and Applications, vol. 10, no. 5, pp. 300–308, 2019, doi: 10.14569/ijacsa.2019.0100538.
[4] T. M. P. Hà, T. M. H. Le, and T. B. Nguyen, “A Comparative Analysis of Filter-Based Feature Selection Methods for Software Fault Prediction,” Journal of Research and Development on Information and Communication Technology, pp. 1–7, Jun. 2021, doi: 10.32913/mic-ict-research-vn.v2021.n1.969.
[5] U. M. Khaire and R. Dhanalakshmi, “Stability of feature selection algorithm: A review,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 4. King Saud bin Abdulaziz University, pp. 1060–1073, Apr. 01, 2022. doi: 10.1016/j.jksuci.2019.06.012.
[6] D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Appl Soft Comput, vol. 97, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.
[7] B. Venkatesh and J. Anuradha, “A review of Feature Selection and its methods,” Cybernetics and Information Technologies, vol. 19, no. 1, pp. 3–26, 2019, doi: 10.2478/CAIT-2019-0001.
[8] R. K. Agrawal, B. Kaur, and S. Sharma, “Quantum based Whale Optimization Algorithm for wrapper feature selection,” Applied Soft Computing Journal, vol. 89, Apr. 2020, doi: 10.1016/j.asoc.2020.106092.
[9] D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Appl Soft Comput, vol. 97, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.
[10] W. Sauerbrei et al., “State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues,” Diagn Progn Res, vol. 4, no. 1, p. 3, Dec. 2020, doi: 10.1186/s41512-020-00074-3.
[11] M. Z. I. Chowdhury and T. C. Turin, “Variable selection strategies and its importance in clinical prediction modelling,” Fam Med Community Health, vol. 8, no. 1, Feb. 2020, doi: 10.1136/fmch-2019-000262.
[12] J. R. Busenbark, H. Yoon, D. L. Gamache, and M. C. Withers, “Omitted Variable Bias: Examining Management Research With the Impact Threshold of a Confounding Variable (ITCV),” J Manage, vol. 48, no. 1, pp. 17–48, Jan. 2022, doi: 10.1177/01492063211006458.
[13] L. Jiang, L. Zhang, C. Li, and J. Wu, “A Correlation-Based Feature Weighting Filter for Naive Bayes,” IEEE Trans Knowl Data Eng, vol. 31, no. 2, pp. 201–213, Feb. 2019, doi: 10.1109/TKDE.2018.2836440.
[14] M. Kondo, C. P. Bezemer, Y. Kamei, A. E. Hassan, and O. Mizuno, “The impact of feature reduction techniques on defect prediction models,” Empir Softw Eng, vol. 24, no. 4, pp. 1925–1963, Aug. 2019, doi: 10.1007/s10664-018-9679-5.
[15] S. Ruan, B. Chen, K. Song, and H. Li, “Weighted naïve Bayes text classification algorithm based on improved distance correlation coefficient,” Neural Comput Appl, vol. 34, no. 4, pp. 2729–2738, Feb. 2022, doi: 10.1007/s00521-021-05989-6.
[16] A. O. Balogun et al., “A novel rank aggregation-based hybrid multifilter wrapper feature selection method in software defect prediction,” Comput Intell Neurosci, vol. 2021, 2021, doi: 10.1155/2021/5069016.
[17] J. Zhang, Y. Xiong, and S. Min, “A new hybrid filter/wrapper algorithm for feature selection in classification,” Anal Chim Acta, vol. 1080, pp. 43–54, Nov. 2019, doi: 10.1016/j.aca.2019.06.054.
[18] M. Shantal, Z. Othman, and A. A. Bakar, “A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min–Max Normalization,” Symmetry (Basel), vol. 15, no. 12, Dec. 2023, doi: 10.3390/sym15122185.
[19] Y. Liu, W. Zhang, G. Qin, and J. Zhao, “A comparative study on the effect of data imbalance on software defect prediction,” in Procedia Computer Science, Elsevier B.V., 2022, pp. 1603–1616. doi: 10.1016/j.procs.2022.11.349.
[20] A. O. Balogun et al., “Software defect prediction using wrapper feature selection based on dynamic re-reranking strategy,” Symmetry (Basel), vol. 13, no. 11, Nov. 2021, doi: 10.3390/sym13112166.
[21] H. Liu, M. Zhou, and Q. Liu, “An embedded feature selection method for imbalanced data classification,” IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 3, pp. 703–715, May 2019, doi: 10.1109/JAS.2019.1911447.
[22] X. Ying, “An Overview of Overfitting and its Solutions,” in Journal of Physics: Conference Series, Institute of Physics Publishing, Mar. 2019. doi: 10.1088/1742-6596/1168/2/022022.
[23] C. B. Gokulnath and S. P. Shantharajah, “An optimized feature selection based on genetic approach and support vector machine for heart disease,” Cluster Comput, vol. 22, pp. 14777–14787, Nov. 2019, doi: 10.1007/s10586-018-2416-4.
[24] F. Maulidina, Z. Rustam, S. Hartini, V. V. P. Wibowo, I. Wirasati, and W. Sadewo, “Feature optimization using Backward Elimination and Support Vector Machines (SVM) algorithm for diabetes classification,” in Journal of Physics: Conference Series, IOP Publishing Ltd, Mar. 2021. doi: 10.1088/1742-6596/1821/1/012006.
[25] Putri Nabella, Rudy Herteno, Setyo Wahyu Saputro, Mohammad Reza Faisal, and Friska Abadi, “Impact of a Synthetic Data Vault for Imbalanced Class in Cross-Project Defect Prediction,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 219–230, Apr. 2024, doi: 10.35882/jeeemi.v6i2.409.
[26] I. Wickramasinghe and H. Kalutarage, “Naive Bayes: applications, variations and vulnerabilities: a review of literature with code snippets for implementation,” Soft comput, vol. 25, no. 3, pp. 2277–2293, Feb. 2021, doi: 10.1007/s00500-020-05297-6.
[27] H. Ji, S. Huang, Y. Wu, Z. Hui, and C. Zheng, “A new weighted naive Bayes method based on information diffusion for software defect prediction,” Software Quality Journal, vol. 27, no. 3, pp. 923–968, Sep. 2019, doi: 10.1007/s11219-018-9436-4.
[28] L. Jiang, L. Zhang, L. Yu, and D. Wang, “Class-specific attribute weighted naive Bayes,” Pattern Recognit, vol. 88, pp. 321–330, Apr. 2019, doi: 10.1016/j.patcog.2018.11.032.
[29] S. Chen, G. I. Webb, L. Liu, and X. Ma, “A novel selective naïve Bayes algorithm,” Knowl Based Syst, vol. 192, Mar. 2020, doi: 10.1016/j.knosys.2019.105361.
[30] H. Zhang, L. Jiang, and L. Yu, “Attribute and instance weighted naive Bayes,” Pattern Recognit, vol. 111, Mar. 2021, doi: 10.1016/j.patcog.2020.107674.
[31] A. O. Balogun et al., “Impact of feature selection methods on the predictive performance of software defect prediction models: An extensive empirical study,” Symmetry (Basel), vol. 12, no. 7, Jul. 2020, doi: 10.3390/sym12071147.
[32] O. Karal, “Performance comparison of different kernel functions in SVM for different k value in k-fold cross-validation,” in Proceedings - 2020 Innovations in Intelligent Systems and Applications Conference, ASYU 2020, Institute of Electrical and Electronics Engineers Inc., Oct. 2020. doi: 10.1109/ASYU50717.2020.9259880.
[33] Angga Maulana Akbar, R. Herteno, S. W. Saputro, M. R. Faisal, and R. A. Nugroho, “Optimizing Software Defect Prediction Models: Integrating Hybrid Grey Wolf and Particle Swarm Optimization for Enhanced Feature Selection with Popular Gradient Boosting Algorithm,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 169–181, Apr. 2024, doi: 10.35882/jeeemi.v6i2.388.
[34] Y. F. Zamzam, T. H. Saragih, R. Herteno, Muliadi, D. T. Nugrahadi, and P. H. Huynh, “Comparison of CatBoost and Random Forest Methods for Lung Cancer Classification using Hyperparameter Tuning Bayesian Optimization-based,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 125–136, Apr. 2024, doi: 10.35882/jeeemi.v6i2.382.
Copyright (c) 2024 Muhammad Noor, Radityo Adi Nugroho, Setyo Wahyu Saputro, Rudy Herteno, Friska Abadi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).