Comparative Study of Various Hyperparameter Tuning on Random Forest Classification With SMOTE and Feature Selection Using Genetic Algorithm in Software Defect Prediction

Mulia Kevin  Suryadi; Rudy Herteno; Setyo Wahyu Saputro; Mohammad Reza Faisal; Radityo Adi Nugroho

doi:10.35882/jeeemi.v6i2.375

Mulia Kevin Suryadi Computer Science Department, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0009-0006-2954-6236
Rudy Herteno Computer Science Department, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0000-0003-0637-8090
Setyo Wahyu Saputro Computer Science Department, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0009-0007-9250-7704
Mohammad Reza Faisal Computer Science Department, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0000-0001-5748-7639
Radityo Adi Nugroho Computer Science Department, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0000-0002-7326-7668

DOI: https://doi.org/10.35882/jeeemi.v6i2.375

Keywords: Genetic Algorithm, Hyperparameter Tuning, Random Forest, Software Defect Prediction

Abstract

Software defect prediction is necessary for desktop and mobile applications. Random Forest defect prediction performance can be significantly increased with the parameter optimization process compared to the default parameter. However, the parameter tuning step is commonly neglected. Random Forest has numerous parameters that can be tuned, as a result manually adjusting parameters would diminish the efficiency of Random Forest, yield suboptimal results and it will take a lot of time. This research aims to improve the performance of Random Forest classification by using SMOTE to balance the data, Genetic Algorithm as selection feature, and using hyperparameter tuning to optimize the performance. Apart from that, it is also to find out which hyperparameter tuning method produces the best improvement on the Random Forest classification method. The dataset used in this study is NASA MDP which included 13 datasets. The method used contains SMOTE to handle imbalance data, Genetic Algorithm feature selection, Random Forest classification, and hyperparameter tuning methods including Grid Search, Random Search, Optuna, Bayesian (with Hyperopt), Hyperband, TPE and Nevergrad. The results of this research were carried out by evaluating performance using accuracy and AUC values. In terms of accuracy improvement, the three best methods are Nevergrad, TPE, and Hyperband. In terms of AUC improvement, the three best methods are Hyperband, Optuna, and Random Search. Nevergrad on average improves accuracy by about 3.9% and Hyperband on average improves AUC by about 3.51%. This study indicates that the use of hyperparameter tuning improves Random Forest performance and among all the hyperparameter tuning methods used, Hyperband has the best hyperparameter tuning performance with the highest average increase in both accuracy and AUC. The implication of this research is to increase the use of hyperparameter tuning in software defect prediction and improve software defect prediction performance.

Downloads

Download data is not yet available.

References

H. K. Dam et al., “A deep tree-based model for software defect prediction,” Feb. 2018, [Online]. Available: http://arxiv.org/abs/1802.00921

Z. Li, X. Y. Jing, and X. Zhu, “Progress on approaches to software defect prediction,” IET Software, vol. 12, no. 3. Institution of Engineering and Technology, pp. 161–175, Jun. 01, 2018. doi: 10.1049/iet-sen.2017.0148.

F. Zhang, Q. Zheng, Y. Zou, and A. E. Hassan, “Cross-project defect prediction using a connectivity-based unsupervised classifier,” in Proceedings - International Conference on Software Engineering, IEEE Computer Society, May 2016, pp. 309–320. doi: 10.1145/2884781.2884839.

D. R. Ibrahim, R. Ghnemat, and A. Hudaib, “Software defect prediction using feature selection and random forest algorithm,” in Proceedings - 2017 International Conference on New Trends in Computing Sciences, ICTCS 2017, Institute of Electrical and Electronics Engineers Inc., Jul. 2017, pp. 252–257. doi: 10.1109/ICTCS.2017.39.

R. S. Wahono, “A Systematic Literature Review of Software Defect Prediction: Research Trends, Datasets, Methods and Frameworks,” Journal of Software Engineering, vol. 1, no. 1, 2015, [Online]. Available: http://journal.ilmukomputer.org

G. Rana, E. U. Haq, E. Bhatia, and R. Katarya, “A Study of Hyper-Parameter Tuning in the Field of Software Analytics,” in Proceedings of the 4th International Conference on Electronics, Communication and Aerospace Technology, ICECA 2020, Institute of Electrical and Electronics Engineers Inc., Nov. 2020, pp. 455–459. doi: 10.1109/ICECA49313.2020.9297613.

B. F. F. Huang and P. C. Boutros, “The parameter sensitivity of random forests,” BMC Bioinformatics, vol. 17, no. 1, Sep. 2016, doi: 10.1186/s12859-016-1228-x.

Ö. F. Arar and K. Ayan, “Software defect prediction using cost-sensitive neural network,” Applied Soft Computing Journal, vol. 33, pp. 263–277, Apr. 2015, doi: 10.1016/j.asoc.2015.04.045.

C. Jin, “Software defect prediction model based on distance metric learning,” Soft comput, vol. 25, no. 1, pp. 447–461, Jan. 2021, doi: 10.1007/s00500-020-05159-1.

K. R. Magal, S. Gracia Jacob, and A. Professor, “Improved Random Forest Algorithm for Software Defect Prediction through Data Mining Techniques,” 2015.

M. Shepperd, Q. Song, Z. Sun, and C. Mair, “NASA MDP Software Defects Data Sets,” IEEE Transactions on Software Engineering 39(9) , pp. 1208–1215, 2018, doi: https://doi.org/10.6084/m9.figshare.c.4054940.v1.

C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models,” IEEE Transactions on Software Engineering, vol. 46, no. 11, pp. 1200–1219, Nov. 2020, doi: 10.1109/TSE.2018.2876537.

C. Zhang, J. Song, Z. Pei, and J. Jiang, “An Imbalanced Data Classification Algorithm of De-noising Auto-Encoder Neural Network Based on SMOTE,” MATEC Web of Conferences ICCAE 2016, 2016, doi: 10.1051/conf/2016.

K. A. Putri, W. Fawwaz, and A. Maki, “Enhancing Pneumonia Disease Classification using Genetic Algorithm-Tuned DCGANs and VGG-16 Integration,” Open Access Journal, vol. 6, no. 1, pp. 11–22, 2024, doi: 10.35882/jeemi.v6i1.349.

S. Aalaei, H. Shahraki, A. Rowhanimanesh, and S. Eslami, “Feature selection using genetic algorithm for breast cancer diagnosis: experiment on three different datasets,” 2016.

R. B. Bahaweres, A. Imam Suroso, A. Wahyu Hutomo, I. Permana Solihin, I. Hermadi, and Y. Arkeman, “Tackling Feature Selection Problems with Genetic Algorithms in Software Defect Prediction for Optimization,” in Proceedings - 2nd International Conference on Informatics, Multimedia, Cyber, and Information System, ICIMCIS 2020, Institute of Electrical and Electronics Engineers Inc., Nov. 2020, pp. 64–69. doi: 10.1109/ICIMCIS51567.2020.9354282.

B. H. Shekar and G. Dagnew, Grid Search-Based Hyperparameter Tuning and Classification of Microarray Cancer Data 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP). IEEE, 2019.

T. N. Nuklianggraita, A. Adiwijaya, and A. Aditsania, “On the Feature Selection of Microarray Data for Cancer Detection based on Random Forest Classifier,” JURNAL INFOTEL, vol. 12, no. 3, pp. 89–96, Aug. 2020, doi: 10.20895/infotel.v12i3.485.

C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, “The Impact of Automated Parameter Optimization on Defect Prediction Models,” IEEE Transactions on Software Engineering, vol. 45, no. 7, pp. 683–711, Jul. 2019, doi: 10.1109/TSE.2018.2794977.

P. Probst, M. N. Wright, and A. L. Boulesteix, “Hyperparameters and tuning strategies for random forest,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 9, no. 3. Wiley-Blackwell, May 01, 2019. doi: 10.1002/widm.1301.

T. Yu and H. Zhu, “Hyper-Parameter Optimization: A Review of Algorithms and Applications,” Mar. 2020, [Online]. Available: http://arxiv.org/abs/2003.05689

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A Next-generation Hyperparameter Optimization Framework,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, Jul. 2019, pp. 2623–2631. doi: 10.1145/3292500.3330701.

K. Cheng and S. Takada, “Software Defect Prediction based on JavaBERT and CNN-BiLSTM,” 2023. [Online]. Available: http://ceur-ws.org

S. Shekhar, A. Bansode, and A. Salim, “A Comparative study of Hyper-Parameter Optimization Tools,” Jan. 2022, [Online]. Available: http://arxiv.org/abs/2201.06433

F. F. Firdaus, H. A. Nugroho, and I. Soesanti, “Deep Neural Network with Hyperparameter Tuning for Detection of Heart Disease,” in Proceedings - 2021 IEEE Asia Pacific Conference on Wireless and Mobile, APWiMob 2021, Institute of Electrical and Electronics Engineers Inc., Apr. 2021, pp. 59–65. doi: 10.1109/APWiMob51111.2021.9435250.

H. Erbin and R. Finotello, “Machine learning for complete intersection Calabi-Yau manifolds: A methodological study,” Physical Review D, vol. 103, no. 12, Jun. 2021, doi: 10.1103/PhysRevD.103.126014.

B. Komer, J. Bergstra, and C. Eliasmith, “Hyperopt-Sklearn: Automatic Hyperparameter Configuration for Scikit-Learn,” 2014.

J. Bergstra, B. Komer, D. Yamins, C. Eliasmith, and D. D. Cox, “Computational Science & Discovery Hyperopt: a Python library for model selection and hyperparameter optimization,” 2015.

S. Putatunda and K. Rama, “A comparative analysis of hyperopt as against other approaches for hyper-parameter optimization of XGBoost,” in ACM International Conference Proceeding Series, Association for Computing Machinery, Nov. 2018, pp. 6–10. doi: 10.1145/3297067.3297080.

S. Falkner, A. Klein, and F. Hutter, “Combining Hyperband and Bayesian Optimization,” 2017.

J. Wang, J. Xu, and X. Wang, “Combination of Hyperband and Bayesian Optimization for Hyperparameter Optimization in Deep Learning,” Jan. 2018, [Online]. Available: http://arxiv.org/abs/1801.01596

H. Li, Q. Zhang, X. Qin, and S. Yuantao, “Raw vibration signal pattern recognition with automatic hyper-parameter-optimized convolutional neural network for bearing fault diagnosis,” Proc Inst Mech Eng C J Mech Eng Sci, vol. 234, no. 1, pp. 343–360, Jan. 2020, doi: 10.1177/0954406219875756.

C. Maurice, F. Madrigal, and F. Lerasle, “Hyper-optimization tools comparison for parameter tuning applications,” 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) : Aug. 29 2017-Sept. 1 2017., 2017.

J. Rapin, M. Gallagher, P. Kerschke, M. Preuss, and O. Teytaud, “Exploring the MLDA benchmark on the Nevergrad platform,” in GECCO 2019 Companion - Proceedings of the 2019 Genetic and Evolutionary Computation Conference Companion, Association for Computing Machinery, Inc, Jul. 2019, pp. 1888–1896. doi: 10.1145/3319619.3326830.

J. Rapin, P. Bennet, E. Centeno, D. Haziza, A. Moreau, and O. Teytaud, “Open source evolutionary structured optimization,” in GECCO 2020 Companion - Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, Association for Computing Machinery, Inc, Jul. 2020, pp. 1599–1607. doi: 10.1145/3377929.3398091.

M. R. Ansyari, M. I. Mazdadi, F. Indriani, D. Kartini, and T. H. Saragih, “Implementation of Random Forest and Extreme Gradient Boosting in the Classification of Heart Disease Using Particle Swarm Optimization Feature Selection,” Open Access Journal, vol. 5, no. 4, pp. 250–260, 2023, doi: 10.35882/jeemi.v5i4.322.

K. Vijiyakumar, B. Lavanya, I. Nirmala, and S. S. Caroline, Random Forest Algorithm for the Prediction of Diabetes. 2019.

V. Maulida, R. Herteno, M. R. Faisal, D. Kartini, and F. Abadi, “Feature Selection Using Firefly Algorithm with Tree-Based Classification in Software Defect Prediction,” Open Access Journal, vol. 5, no. 4, pp. 223–230, 2023, doi: 10.35882/jeemi.v5i4.315.

Y. N. Soe, P. I. Santosa, and R. Hartanto, Software Defect Prediction Using Random Forest Algorithm 2018 12th South East Asian Technical University Consortium (SEATUC). IEEE, 2018.

I. Syarif, A. Prugel-Bennett, and G. Wills, “SVM Parameter Optimization using Grid Search and Genetic Algorithm to Improve Classification Performance,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 14, no. 4, p. 1502, Dec. 2016, doi: 10.12928/telkomnika.v14i4.3956.