Optimization of Backward Elimination for Software Defect Prediction with Correlation Coefficient Filter Method

Muhammad Noor; Radityo Adi Nugroho; Setyo Wahyu Saputro; Rudy Herteno; Friska Abadi

doi:10.35882/jeeemi.v6i4.466

Muhammad Noor Lambung Mangkurat University
Radityo Adi Nugroho Lambung Mangkurat University
Setyo Wahyu Saputro Lambung Mangkurat University
Rudy Herteno Lambung Mangkurat University
Friska Abadi Lambung Mangkurat University

DOI: https://doi.org/10.35882/jeeemi.v6i4.466

Abstract

Detecting software defects is a crucial step for software development not only to reduce cost and save time, but also to mitigate more costly losses. Backward Elimination is one method for detecting software defects. Notably Backward Elimination may remove features that may later become significant to the outcome affecting the performance of Backward Elimination. The aim of this study is to improve Backward Elimination performance. In this study, several features were selected based on their correlation coefficient, with the selected feature applied to improve Backward Elimination final model performance. The final model was validated using cross validation with Naïve Bayes as the classification method on the NASA MDP dataset to determine the accuracy and Area Under the Curve (AUC) of the final model. Using top 10 correlation feature and Backward Elimination achieve an average result of 86.6% accuracy and 0.797 AUC, while using top 20 correlation feature and Backward Elimination achieved an average result of 84% accuracy and 0.812 AUC. Compare to using Backward Elimination and Naïve Bayes respectively the improvement using top 10 correlation feature as follows: AUC:1.52%, 13.53% and Accuracy: 13%, 12.4% while the improvement using top 20 correlation feature as follows: AUC:3.43%, 15.66% and Accuracy: 10.4%, 9.8%. Results showed that selecting the top 10 and top 20 feature based on its correlation before using Backward Elimination have better result than only using Backward Elimination. This result shows that combining Backward Elimination with correlation coefficient feature selection does improve Backward Elimination’s final model and yielding good results for detecting software defects.

Downloads

Download data is not yet available.

References

[1] A. Rahim, Z. Hayat, M. Abbas, A. Rahim, and M. A. Rahim, “Software Defect Prediction with Naïve Bayes Classifier,” in Proceedings of 18th International Bhurban Conference on Applied Sciences and Technologies, IBCAST 2021, Institute of Electrical and Electronics Engineers Inc., Jan. 2021, pp. 293–297. doi: 10.1109/IBCAST51254.2021.9393250.
[2] M. K. Thota, F. H. Shajin, and P. Rajesh, “Survey on software defect prediction techniques,” International Journal of Applied Science and Engineering, vol. 17, no. 4, pp. 331–344, Jan. 2020, doi: 10.6703/IJASE.202012_17(4).331.
[3] A. Iqbal et al., “Performance analysis of machine learning techniques on software defect prediction using NASA datasets,” International Journal of Advanced Computer Science and Applications, vol. 10, no. 5, pp. 300–308, 2019, doi: 10.14569/ijacsa.2019.0100538.
[4] T. M. P. Hà, T. M. H. Le, and T. B. Nguyen, “A Comparative Analysis of Filter-Based Feature Selection Methods for Software Fault Prediction,” Journal of Research and Development on Information and Communication Technology, pp. 1–7, Jun. 2021, doi: 10.32913/mic-ict-research-vn.v2021.n1.969.
[5] U. M. Khaire and R. Dhanalakshmi, “Stability of feature selection algorithm: A review,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 4. King Saud bin Abdulaziz University, pp. 1060–1073, Apr. 01, 2022. doi: 10.1016/j.jksuci.2019.06.012.
[6] D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Appl Soft Comput, vol. 97, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.
[7] B. Venkatesh and J. Anuradha, “A review of Feature Selection and its methods,” Cybernetics and Information Technologies, vol. 19, no. 1, pp. 3–26, 2019, doi: 10.2478/CAIT-2019-0001.
[8] R. K. Agrawal, B. Kaur, and S. Sharma, “Quantum based Whale Optimization Algorithm for wrapper feature selection,” Applied Soft Computing Journal, vol. 89, Apr. 2020, doi: 10.1016/j.asoc.2020.106092.
[9] D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Appl Soft Comput, vol. 97, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.
[10] W. Sauerbrei et al., “State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues,” Diagn Progn Res, vol. 4, no. 1, p. 3, Dec. 2020, doi: 10.1186/s41512-020-00074-3.
[11] M. Z. I. Chowdhury and T. C. Turin, “Variable selection strategies and its importance in clinical prediction modelling,” Fam Med Community Health, vol. 8, no. 1, Feb. 2020, doi: 10.1136/fmch-2019-000262.
[12] J. R. Busenbark, H. Yoon, D. L. Gamache, and M. C. Withers, “Omitted Variable Bias: Examining Management Research With the Impact Threshold of a Confounding Variable (ITCV),” J Manage, vol. 48, no. 1, pp. 17–48, Jan. 2022, doi: 10.1177/01492063211006458.
[13] L. Jiang, L. Zhang, C. Li, and J. Wu, “A Correlation-Based Feature Weighting Filter for Naive Bayes,” IEEE Trans Knowl Data Eng, vol. 31, no. 2, pp. 201–213, Feb. 2019, doi: 10.1109/TKDE.2018.2836440.
[14] M. Kondo, C. P. Bezemer, Y. Kamei, A. E. Hassan, and O. Mizuno, “The impact of feature reduction techniques on defect prediction models,” Empir Softw Eng, vol. 24, no. 4, pp. 1925–1963, Aug. 2019, doi: 10.1007/s10664-018-9679-5.
[15] S. Ruan, B. Chen, K. Song, and H. Li, “Weighted naïve Bayes text classification algorithm based on improved distance correlation coefficient,” Neural Comput Appl, vol. 34, no. 4, pp. 2729–2738, Feb. 2022, doi: 10.1007/s00521-021-05989-6.
[16] A. O. Balogun et al., “A novel rank aggregation-based hybrid multifilter wrapper feature selection method in software defect prediction,” Comput Intell Neurosci, vol. 2021, 2021, doi: 10.1155/2021/5069016.
[17] J. Zhang, Y. Xiong, and S. Min, “A new hybrid filter/wrapper algorithm for feature selection in classification,” Anal Chim Acta, vol. 1080, pp. 43–54, Nov. 2019, doi: 10.1016/j.aca.2019.06.054.
[18] M. Shantal, Z. Othman, and A. A. Bakar, “A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min–Max Normalization,” Symmetry (Basel), vol. 15, no. 12, Dec. 2023, doi: 10.3390/sym15122185.
[19] Y. Liu, W. Zhang, G. Qin, and J. Zhao, “A comparative study on the effect of data imbalance on software defect prediction,” in Procedia Computer Science, Elsevier B.V., 2022, pp. 1603–1616. doi: 10.1016/j.procs.2022.11.349.
[20] A. O. Balogun et al., “Software defect prediction using wrapper feature selection based on dynamic re-reranking strategy,” Symmetry (Basel), vol. 13, no. 11, Nov. 2021, doi: 10.3390/sym13112166.
[21] H. Liu, M. Zhou, and Q. Liu, “An embedded feature selection method for imbalanced data classification,” IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 3, pp. 703–715, May 2019, doi: 10.1109/JAS.2019.1911447.
[22] X. Ying, “An Overview of Overfitting and its Solutions,” in Journal of Physics: Conference Series, Institute of Physics Publishing, Mar. 2019. doi: 10.1088/1742-6596/1168/2/022022.
[23] C. B. Gokulnath and S. P. Shantharajah, “An optimized feature selection based on genetic approach and support vector machine for heart disease,” Cluster Comput, vol. 22, pp. 14777–14787, Nov. 2019, doi: 10.1007/s10586-018-2416-4.
[24] F. Maulidina, Z. Rustam, S. Hartini, V. V. P. Wibowo, I. Wirasati, and W. Sadewo, “Feature optimization using Backward Elimination and Support Vector Machines (SVM) algorithm for diabetes classification,” in Journal of Physics: Conference Series, IOP Publishing Ltd, Mar. 2021. doi: 10.1088/1742-6596/1821/1/012006.
[25] Putri Nabella, Rudy Herteno, Setyo Wahyu Saputro, Mohammad Reza Faisal, and Friska Abadi, “Impact of a Synthetic Data Vault for Imbalanced Class in Cross-Project Defect Prediction,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 219–230, Apr. 2024, doi: 10.35882/jeeemi.v6i2.409.
[26] I. Wickramasinghe and H. Kalutarage, “Naive Bayes: applications, variations and vulnerabilities: a review of literature with code snippets for implementation,” Soft comput, vol. 25, no. 3, pp. 2277–2293, Feb. 2021, doi: 10.1007/s00500-020-05297-6.
[27] H. Ji, S. Huang, Y. Wu, Z. Hui, and C. Zheng, “A new weighted naive Bayes method based on information diffusion for software defect prediction,” Software Quality Journal, vol. 27, no. 3, pp. 923–968, Sep. 2019, doi: 10.1007/s11219-018-9436-4.
[28] L. Jiang, L. Zhang, L. Yu, and D. Wang, “Class-specific attribute weighted naive Bayes,” Pattern Recognit, vol. 88, pp. 321–330, Apr. 2019, doi: 10.1016/j.patcog.2018.11.032.
[29] S. Chen, G. I. Webb, L. Liu, and X. Ma, “A novel selective naïve Bayes algorithm,” Knowl Based Syst, vol. 192, Mar. 2020, doi: 10.1016/j.knosys.2019.105361.
[30] H. Zhang, L. Jiang, and L. Yu, “Attribute and instance weighted naive Bayes,” Pattern Recognit, vol. 111, Mar. 2021, doi: 10.1016/j.patcog.2020.107674.
[31] A. O. Balogun et al., “Impact of feature selection methods on the predictive performance of software defect prediction models: An extensive empirical study,” Symmetry (Basel), vol. 12, no. 7, Jul. 2020, doi: 10.3390/sym12071147.
[32] O. Karal, “Performance comparison of different kernel functions in SVM for different k value in k-fold cross-validation,” in Proceedings - 2020 Innovations in Intelligent Systems and Applications Conference, ASYU 2020, Institute of Electrical and Electronics Engineers Inc., Oct. 2020. doi: 10.1109/ASYU50717.2020.9259880.
[33] Angga Maulana Akbar, R. Herteno, S. W. Saputro, M. R. Faisal, and R. A. Nugroho, “Optimizing Software Defect Prediction Models: Integrating Hybrid Grey Wolf and Particle Swarm Optimization for Enhanced Feature Selection with Popular Gradient Boosting Algorithm,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 169–181, Apr. 2024, doi: 10.35882/jeeemi.v6i2.388.
[34] Y. F. Zamzam, T. H. Saragih, R. Herteno, Muliadi, D. T. Nugrahadi, and P. H. Huynh, “Comparison of CatBoost and Random Forest Methods for Lung Cancer Classification using Hyperparameter Tuning Bayesian Optimization-based,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 125–136, Apr. 2024, doi: 10.35882/jeeemi.v6i2.382.