Optimizing Software Defect Prediction Models: Integrating Hybrid Grey Wolf and Particle Swarm Optimization for Enhanced Feature Selection with Popular Gradient Boosting Algorithm

Angga Maulana Akbar; Rudy Herteno; Setyo Wahyu Saputro; Mohammad Reza Faisal; Radityo Adi Nugroho

doi:10.35882/jeeemi.v6i2.388

Angga Maulana Akbar Department of Computer Science, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0000-0001-8544-8995
Rudy Herteno Department of Computer Science, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0000-0003-0637-8090
Setyo Wahyu Saputro Department of Computer Science, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0009-0007-9250-7704
Mohammad Reza Faisal Department of Computer Science, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0000-0001-5748-7639
Radityo Adi Nugroho Department of Computer Science, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0000-0002-7326-7668

DOI: https://doi.org/10.35882/jeeemi.v6i2.388

Keywords: Hybrid Grey Wolf Optimizer Particel Swarm Optimization, Software Defect Prediction, Machine Learning, Boosting Algorithm

Abstract

Software defects, also referred to as software bugs, are anomalies or flaws in computer program that cause software to behave unexpectedly or produce incorrect results. These defects can manifest in various forms, including coding errors, design flaws, and logic mistakes, this defect have the potential to emerge at any stage of the software development lifecycle. Traditional prediction models usually have lower prediction performance. To address this issue, this paper proposes a novel prediction model using Hybrid Grey Wolf Optimizer and Particle Swarm Optimization (HGWOPSO). This research aims to determine whether the Hybrid Grey Wolf and Particle Swarm Optimization model could potentially improve the effectiveness of software defect prediction compared to base PSO and GWO algorithms without hybridization. Furthermore, this study aims to determine the effectiveness of different Gradient Boosting Algorithm classification algorithms when combined with HGWOPSO feature selection in predicting software defects. The study utilizes 13 NASA MDP dataset. These dataset are divided into testing and training data using 10-fold cross-validation. After data is divided, SMOTE technique is employed in training data. This technique generates synthetic samples to balance the dataset, ensuring better performance of the predictive model. Subsequently feature selection is conducted using HGWOPSO Algorithm. Each subset of the NASA MDP dataset will be processed by three boosting classification algorithms namely XGBoost, LightGBM, and CatBoost. Performance evaluation is based on the Area under the ROC Curve (AUC) value. Average AUC values yielded by HGWOPSO XGBoost, HGWOPSO LightGBM, and HGWOPSO CatBoost are 0.891, 0.881, and 0.894, respectively. Results of this study indicated that utilizing the HGWOPSO algorithm improved AUC performance compared to the base GWO and PSO algorithms. Specifically, HGWOPSO CatBoost achieved the highest AUC of 0.894. This represents a 6.5% increase in AUC with a significance value of 0.00552 compared to PSO CatBoost, and a 6.3% AUC increase with a significance value of 0.00148 compared to GWO CatBoost. This study demonstrated that HGWOPSO significantly improves the performance of software defect prediction. The implication of this research is to enhance software defect prediction models by incorporating hybrid optimization techniques and combining them with gradient boosting algorithms, which can potentially identify and address defects more accurately

Downloads

References

M. K. Thota, F. H. Shajin, dan P. Rajesh, "Survey on software defect prediction techniques," International Journal of Applied Science and Engineering, vol. 17, no. 4, pp. 331-344, 2020, doi: 10.6703/IJASE.202012_17(4).331.

J. Li, P. He, J. Zhu and M. R. Lyu, "Software Defect Prediction via Convolutional Neural Network," 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), Prague, Czech Republic, pp. 318-328, 2017. doi: 10.1109/QRS.2017.42.

M. S. Rawat and S. K. Dubey, “Software defect prediction models for quality improvement: A literature study,” Int. J. Comput. Sci. Issues, vol. 9, no. 5 5–2, pp. 288–296, 2022. [Online], Available: https://www.researchgate.net/publication/287793526_Software_Defect_Prediction_Models_for_Quality_Improvement_A_Literature_Study

Z. Li, X. Y. Jing, X. Zhu, "Progress on approaches to software defect prediction," IET Software, vol. 12, no. 3, pp. 161-175, 2018, doi: 10.1049/iet-sen.2017.0148.

J. Ren, K. Qin, Y. Ma, and G. Luo, “Survey on Software Defect Prediction Using Machine Learning Techniques,” J. Appl. Math.,vol. 3, no. 12, pp. 2319–7064, 2018, doi: 10.1155/2014/785435.

G. Czibula, Z. Marian, I. G. Czibula, "Software defect prediction using relational association rule mining," Information Sciences, vol. 264, pp. 260-278, 2014, doi: 10.1016/j.ins.2013.12.031.

B. Khan, R. Naseem, M. A. Shah, K. Wakil, A. Khan, M. I. Uddin, M. Mahmoud, "Software defect prediction for healthcare big data: an empirical evaluation of machine learning techniques," Journal of Healthcare Engineering, 2021, doi: 10.1155/2021/8899263.

K. K. Bejjanki, J. Gyani, and N. Gugulothu, “Class imbalance reduction (CIR): A novel approach to software defect prediction in the presence of class imbalance,” Symmetry (Basel)., vol. 12, no. 3, 2020, doi: 10.3390/sym12030407.

H. Rahardian, M. R. Faisal, F. Abadi, R. A. Nugroho, R. Herteno, "Implementation of Data Level Approach Techniques to Solve Unbalanced Data Case on Software Defect Classification," Journal of Data Science and Software Engineering, vol. 1, no. 01, pp. 53-62, 2020, doi: 10.20527/jdsse.v1i01.13.

K. Khadijah, P. S. Sasongko, "Software Defect Prediction Using Synthetic Minority Over-sampling Technique and Extreme Learning Machine," Journal of Telematics and Informatics (JTI), vol. 7, no. 2, pp. 60-68, 2019. doi: 10.12928/jti.v7i2.

R. S. Wahono, N. Suryana, S. Ahmad, "Metaheuristic optimization based feature selection for software defect prediction," Journal of Software, vol. 9, no. 5, pp. 1324-1333, 2014, doi:10.4304/jsw.9.5.1324-1333.

M. K. Suryadi, K, H. Rudy, S. W. Saputro, M. R. Faisal, and R. A. Nugroho, "A Comparative Study of Various Hyperparameter Tuning on Random Forest Classification With SMOTE and Feature Selection Using Genetic Algorithm in Software Defect Prediction," Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, 2024. doi: 10.35882/jeeemi.v6i2.375.

A. Suryadi, "Integration of feature selection with data level approach for software defect prediction," Sinkron: Jurnal dan Penelitian Teknik Informatika, vol. 4, no. 1, pp. 51-57, 2019. doi: 10.33395/sinkron.v3i1.10137

R. S. Wahono and N. Suryana, "Combining particle swarm optimization based feature selection and bagging technique for software defect prediction," International Journal of Software Engineering and Its Applications, vol. 7, no. 5, pp. 153-166, 2013. doi: 10.14257/ijseia.2013.7.5.16

M. Cai, “An Improved Particle Swarm Optimization Algorithm and Its Application to the Extreme Value Optimization Problem of Multivariable Function,” Comput. Intell. Neurosci., vol. 2022, 2022, doi: 10.1155/2022/1935272.

F. Catak and T. Bilgem, "Genetic algorithm based feature selection in high dimensional text dataset classification," WSEAS Transactions On Information Science And Applications, vol. 12, no. 28, pp. 290-296, 2015, [Online], Available:

https://www.researchgate.net/publication/283661718_Genetic_Algorithm_based_Feature_Selection_in_High_Dimensional_Text_Dataset_Classification

A. G. Gad, Particle Swarm Optimization Algorithm and Its Applications: A Systematic Review, vol. 29, no. 5. pp. 2531-2561, 2022. doi: 10.1007/s11831-021-09694-4.

R. Malhotra, A. Shakya, R. Ranjan, and R. Banshi, "Software defect prediction using Binary Particle Swarm Optimization with Binary Cross Entropy as the fitness function," Journal of Physics: Conference Series, vol. 1767, no. 1, p. 012003, February 2021. doi: 10.1088/1742-6596/1767/1/012003

V. Pappu, P. M. Pardalos, "High-dimensional data classification," Clusters, Orders, and Trees: Methods and Applications: In Honor of Boris Mirkin's 70th Birthday, pp. 119-150, 2014, doi: 10.1007/978-1-4939-0742-7_8.

R. Blagus, L. Lusa, "Boosting for high-dimensional two-class prediction," BMC Bioinformatics, vol. 16, pp. 1-17, 2015, doi: 10.1186/s12859-015-0868-7.

S. Ghosh, A. Rana, and V. Kansal, “A Nonlinear Manifold Detection based Model for Software Defect Prediction,” Procedia Comput. Sci., vol. 132, pp. 581–594, 2018, doi: 10.1016/j.procs.2018.05.012.

R. T. Yunardi, R. Apsari, and M. Yasin, "Comparison of Machine Learning Algorithm For Urine Glucose Level Classification Using Side-Polished Fiber Sensor," Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 2, no. 2, pp. 33–39, 2020, doi: 10.35882/jeeemi.v2i2.1.

D. Berrar, "Cross-validation," Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, vol. 1–3, no. January 2018, pp. 542–545, 2018, doi: 10.1016/B978-0-12-809633-8.20349- X.

M. Anbu and G. S. Anandha Mala, "Feature selection using firefly algorithm in software defect prediction," Cluster Computing, vol. 22, no. 4, pp. 10925–10934, 2019, doi: 10.1007/s10586-017-1235-3.

M. M. Mafazy, "Classification of COVID-19 Cough Sounds using Mel Frequency Cepstral Coefficient ( MFCC ) Feature Extraction and Support Vector Machine Telematika Classification of COVID-19 Cough Sounds using Mel Frequency Cepstral Coefficient ( MFCC ) Feature Extraction," no. August, 2023, doi: 10.35671/telematika.v16i2.2569.

A. Fernández, S. Garcia, F. Herrera, dan N. V. Chawla, "SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary," Journal of Artificial Intelligence Research, vol. 61, pp. 863-905, 2018, doi: 10.1613/jair.1.11192

A. Alazba dan H. Aljamaan, "Software defect prediction using stacking generalization of optimized tree-based ensembles," Applied Sciences, vol. 12, no. 9, pp. 4577, 2022, doi: 10.3390/app12094577

C. Zhang, J. Song, Z. Pei, and J. Jiang, “An Imbalanced Data Classification Algorithm of De-noising Auto-Encoder Neural Network Based on SMOTE,” MATEC Web of Conferences ICCAE 2016, 2016, doi: 10.1051/conf/2016.

Z. Ye, Y. Xu, Q. He, M. Wang, W. Bai, and H. Xiao, “Feature Selection Based on Adaptive Particle Swarm Optimization with Leadership Learning,” Comput. Intell. Neurosci., vol. 2022, 2022, doi: 10.1155/2022/1825341.

T. M. Shami, A. A. El-Saleh, M. Alswaitti, Q. Al-Tashi, M. A. Summakieh, and S. Mirjalili, “Particle Swarm Optimization: A Comprehensive Survey,” IEEE Access, vol. 10, pp. 10031–10061, 2022, doi: 10.1109/ACCESS.2022.3142859.

M. Banga, A. Bansal, and A. Singh, “Proposed hybrid approach to predict software fault detection,” Int. J. Performability Eng., vol. 15, no. 8, pp. 2049–2061, 2019, doi: 10.23940/ijpe.19.08.p4.20492061.

T. M. Shami, A. A. El-Saleh, M. Alswaitti, Q. Al-Tashi, M. A. Summakieh, and S. Mirjalili, “Particle Swarm Optimization: A Comprehensive Survey,” IEEE Access, vol. 10, pp. 10031–10061, 2022, doi: 10.1109/ACCESS.2022.3142859.

S. Mirjalili, "How effective is the Grey Wolf optimizer in training multi-layer perceptrons," Applied Intelligence, vol. 43, pp. 150-161, 2015, doi: 10.1007/s10489-014-0645-7

O. O. Akinola, A. E. Ezugwu, J. O. Agushaka, R. A. Zitar, dan L. Abualigah, "Multiclass feature selection with metaheuristic optimization algorithms: a review," Neural Computing and Applications, vol. 34, no. 22, pp. 19751-19790, 2022, doi: 10.1007/s00521-022-07705-4

A. Kaveh dan P. Zakian, "Improved GWO algorithm for optimal design of truss structures," Engineering with Computers, vol. 34, pp. 685-707, 2018, doi: 10.1007/s00366-017-0567-1

S. Mirjalili, S. M. Mirjalili, dan A. Lewis, "Grey wolf optimizer," Advances in Engineering Software, vol. 69, pp. 46-61, 2014, doi: 10.1016/j.advengsoft.2013.12.007

F. A. Şenel et al., "A novel hybrid PSO-GWO algorithm for optimization problems," Engineering with Computers, vol. 35, pp. 1359-1373, 2019, doi: 10.1007/s00366-018-0668-5

J. Teng, J. Lv, L. Guo, "An improved hybrid grey wolf optimization algorithm," Soft Computing, vol. 23, pp. 6617-6631, 2019, doi: 10.1007/s00500-018-3310-y.

S. Mehta and K. S. Patnaik, "Improved prediction of software defects using ensemble machine learning techniques," Neural Computing and Applications, vol. 33, no. 16, pp. 10551-10562, 2021, doi: 10.1007/s00521-021-05811-3

S. Ramraj, N. Uzir, R. Sunil, and S. Banerjee, "Experimenting XGBoost algorithm for prediction and classification of different datasets," International Journal of Control Theory and Applications, vol. 9, no. 40, pp. 651-662, 2016, [Online], Available: https://www.researchgate.net/publication/318132203_Experimenting_XGBoost_Algorithm_for_Prediction_and_Classification_of_Different_Datasets

R. Hoque, S. Das, M. Hoque, and E. Haque, "Breast Cancer Classification using XGBoost," World Journal of Advanced Research and Reviews, vol. 21, no. 2, pp. 1985-1994, 2024, doi: 10.30574/wjarr.2024.21.2.0625

T. Chen and C. Guestrin, "Xgboost: A scalable tree boosting system," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785-794, 2016 doi: 10.1145/2939672.2939785

M. R. Ansyari, M. I. Mazdadi, F. Indriani, D. Kartini, and T. H. Saragih, "Implementation of Random Forest and Extreme Gradient Boosting in the Classification of Heart Disease Using Particle Swarm Optimization Feature Selection," Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 5, no. 4, pp. 250-260, 2023. doi: 10.35882/jeemi.v5i4.322

H. Mo, H. Sun, J. Liu, and S. Wei, "Developing window behavior models for residential buildings using XGBoost algorithm," Energy and Buildings, vol. 205, p. 109564, 2019. doi: 10.1016/j.enbuild.2019.109564

Y. Wang and T. Wang, "Application of improved LightGBM model in blood glucose prediction," Applied Sciences, vol. 10, no. 9, p. 3227, 2020. doi: 10.3390/app10093227

G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, "Lightgbm: A highly efficient gradient boosting decision tree," in Advances in Neural Information Processing Systems 30, 2017, [Online], Available:

https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html

D. D. Rufo, T. G. Debelee, A. Ibenthal, and W. G. Negera, "Diagnosis of diabetes mellitus using gradient boosting machine (LightGBM)," Diagnostics, vol. 11, no. 9, p. 1714, 2021. doi: 10.3390/diagnostics11091714

S. Jhaveri, I. Khedkar, Y. Kantharia, and S. Jaswal, "Success prediction using random forest, catboost, xgboost and adaboost for kickstarter campaigns," in 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), pp. 1170-1173, IEEE, 2019. doi: 10.1109/ICCMC.2019.8819828

L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, "CatBoost: unbiased boosting with categorical features," in Advances in Neural Information Processing Systems 31, 2018, [Online], Available:

https://proceedings.neurips.cc/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html

G. Huang, L. Wu, X. Ma, W. Zhang, J. Fan, X. Yu, W. Zeng, and H. Zhou, "Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions," Journal of Hydrology, vol. 574, pp. 1029-1041, 2019. doi: 10.1016/j.jhydrol.2019.04.085

J. Huang and C. X. Ling, "Using AUC and accuracy in evaluating learning algorithms," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 3, pp. 299-310, 2005. doi: 10.1109/TKDE.2005.50

D. Valero-Carreras, J. Alcaraz, and M. Landete, “Comparing two SVM models through different metrics based on the confusion matrix,” Comput Oper Res, vol. 152, Apr. 2023, doi: 10.1016/j.cor.2022.106131

C. Cortes and M. Mohri, "AUC optimization vs. error rate minimization," in Advances in Neural Information Processing Systems 16, 2003.

A. R. Syulistyo, D. M. J. Purnomo, M. F. Rachmadi, and A. Wibowo, “Convolutions Subsampling Convolutions Gaussian connection Full connection Full connection Subsampling,” JIKI (Jurnal Ilmu Komput. dan Informasi) UI, vol. 9, no. 1, pp. 52–58, 2016

X. U. Manfei, D. Fralick, J. Z. Zheng, B. Wang, and F. E. N. G. Changyong, "The differences and similarities between two-sample t-test and paired t-test," Shanghai Archives of Psychiatry, vol. 29, no. 3, p. 184, 2017. doi: 10.11919/j.issn.1002-0829.217070

G. B. Limentani, M. C. Ringo, F. Ye, M. L. Bergquist, and E. O. McSorley, "Beyond the t-test: statistical equivalence testing,", pp. 221-A, 2005, doi: 10.1021/ac053390m

R. Sefira, A. Setiawan, R. Hidayatullah, and R. Darmayanti, "The Influence of the Snowball Throwing Learning Model on Pythagorean Theorem Material on Learning Outcomes," Edutechnium Journal of Educational Technology, vol. 2, no. 1, pp. 1-7, 2024. [Online], Available:https://edutechnium.com/journal/index.php/edutechnium/article/view/37

K. Muthukumaran, A. Rallapalli, and N. L. Bhanu Murthy, “Impact of feature selection techniques on bug prediction models,” ACM Int. Conf. Proceeding Ser., vol. 18-20-Febr, pp. 120–129, 2015, doi: 10.1145/2723742.2723754.

A. Kalsoom, M. Maqsood, M. A. Ghazanfar, F. Aadil, and S. Rho, A dimensionality reduction-based efficient software fault prediction using Fisher linear discriminant analysis (FLDA), vol. 74, no. 9. Springer US, 2018. doi: 10.1007/s11227-018-2326-5.

A. Iqbal and S. Aftab, “A classification framework for software defect prediction using multi-filter feature selection technique and MLP,” Int. J. Mod. Educ. Comput. Sci., vol. 12, no. 1, pp. 18–25, 2020, doi: 10.5815/ijmecs.2020.01.03.

F. Meng, W. Cheng, and J. Wang, "An Integrated Semi-supervised Software Defect Prediction Model," Journal of Internet Technology, vol. 24, no. 6, pp. 1307-1317, 2023, doi:10.53106/160792642023112406013