A Comparative Analysis of Polynomial-fit-SMOTE Variations with Tree-Based Classifiers on Software Defect Prediction
Abstract
Software defects present a significant challenge to the reliability of software systems, often resulting in substantial economic losses. This study examines the efficacy of polynomial-fit SMOTE (pf-SMOTE) variants in combination with tree-based classifiers for software defect prediction, utilising the NASA Metrics Data Program (MDP) dataset. The research methodology involves partitioning the dataset into training and test subsets, applying pf-SMOTE oversampling, and evaluating classification performance using Decision Trees, Random Forests, and Extra Trees. Findings indicate that the combination of pf-SMOTE-star oversampling with Extra Tree classification achieves the highest average accuracy (90.91%) and AUC (95.67%) across 12 NASA MDP datasets. This demonstrates the potential of pf-SMOTE variants to enhance classification effectiveness. However, it is important to note that caution is warranted regarding potential biases introduced by synthetic data. These findings represent a significant advancement over previous research endeavors, underscoring the critical role of meticulous algorithm selection and dataset characteristics in optimizing classification outcomes. Noteworthy implications include advancements in software reliability and decision support for software project management. Future research may delve into synergies between pf-SMOTE variants and alternative classification methods, as well as explore the integration of hyperparameter tuning to further refine classification performance.
Downloads
References
G. Fan, X. Diao, H. Yu, K. Yang, and L. Chen, “Software Defect Prediction via Attention-Based Recurrent Neural Network,” Sci Program, vol. 2019, 2019, doi: 10.1155/2019/6230953.
A. Alsaeedi and M. Z. Khan, “Software Defect Prediction Using Supervised Machine Learning and Ensemble Techniques: A Comparative Study,” Journal of Software Engineering and Applications, vol. 12, no. 05, pp. 85–100, 2019, doi: 10.4236/jsea.2019.125007.
X. Chen, D. Zhang, Y. Zhao, Z. Cui, and C. Ni, “Software defect number prediction: Unsupervised vs supervised methods,” Inf Softw Technol, vol. 106, pp. 161–181, Feb. 2019, doi: 10.1016/j.infsof.2018.10.003.
T. Zhou, X. Sun, X. Xia, B. Li, and X. Chen, “Improving defect prediction with deep forest,” Inf Softw Technol, vol. 114, pp. 204–216, Oct. 2019, doi: 10.1016/j.infsof.2019.07.003.
S. Goyal, “Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction,” Artif Intell Rev, vol. 55, no. 3, pp. 2023–2064, Mar. 2022, doi: 10.1007/s10462-021-10044-w.
L. Manservigi et al., “Detection of Unit of Measure Inconsistency in gas turbine sensors by means of Support Vector Machine classifier,” ISA Trans, vol. 123, pp. 323–338, Apr. 2022, doi: 10.1016/j.isatra.2021.05.034.
H. Liang, Y. Yu, L. Jiang, and Z. Xie, “Seml: A Semantic LSTM Model for Software Defect Prediction,” IEEE Access, vol. 7, pp. 83812–83824, 2019, doi: 10.1109/ACCESS.2019.2925313.
L. Qiao, X. Li, Q. Umer, and P. Guo, “Deep learning based software defect prediction,” Neurocomputing, vol. 385, pp. 100–110, Apr. 2020, doi: 10.1016/j.neucom.2019.11.067.
M. Z. F. N. Siswantoro and U. L. Yuhana, “Software Defect Prediction Based on Optimized Machine Learning Models: A Comparative Study,” Teknika, vol. 12, no. 2, pp. 166–172, Jun. 2023, doi: 10.34148/teknika.v12i2.634.
H. Aljamaan and A. Alazba, “Software defect prediction using tree-based ensembles,” in PROMISE 2020 - Proceedings of the 16th ACM International Conference on Predictive Models and Data Analytics in Software Engineering, Co-located with ESEC/FSE 2020, Association for Computing Machinery, Inc, Nov. 2020, pp. 1–10. doi: 10.1145/3416508.3417114.
F. Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, “Data imbalance in classification: Experimental evaluation,” Inf Sci (N Y), vol. 513, pp. 429–441, Mar. 2020, doi: 10.1016/j.ins.2019.11.004.
X. W. Liang, A. P. Jiang, T. Li, Y. Y. Xue, and G. T. Wang, “LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM,” Knowl Based Syst, vol. 196, May 2020, doi: 10.1016/j.knosys.2020.105845.
R. Malhotra and S. Kamal, “An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data,” Neurocomputing, vol. 343, pp. 120–140, May 2019, doi: 10.1016/j.neucom.2018.04.090.
D. Bajer, B. Zonć, M. Dudjak, and G. Martinović, “Performance Analysis of SMOTE-based Oversampling Techniques When Dealing with Data Imbalance,” 2019 International Conference on Systems, Signals and Image Processing (IWSSIP), 2019, doi: 10.1109/IWSSIP.2019.8787306.
A. Fernández, S. García, F. Herrera, and N. V Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” 2018.
G. Kovács, “An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets,” Applied Soft Computing Journal, vol. 83, Oct. 2019, doi: 10.1016/j.asoc.2019.105662.
S. Gazzah and N. E. Ben Amara, “New oversampling approaches based on polynomial fitting for imbalanced data sets,” in DAS 2008 - Proceedings of the 8th IAPR International Workshop on Document Analysis Systems, 2008, pp. 677–684. doi: 10.1109/DAS.2008.74.
S. Barua, Md. M. Islam, and K. Murase, “ProWSyn: Proximity Weighted Synthetic Oversampling Technique for Imbalanced Data Set Learning,” Lecture Notes in Computer Science, vol. 7819, pp. 317–328, 2013, doi: https://doi.org/10.1007/978-3-642-37456-2_27.
J. A. Sáez, J. Luengo, J. Stefanowski, and F. Herrera, “SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering,” Inf Sci (N Y), vol. 291, no. C, pp. 184–203, 2015, doi: 10.1016/j.ins.2014.08.051.
J. Lee, N. R. Kim, and J. H. Lee, “An over-sampling technique with rejection for imbalanced class learning,” in ACM IMCOM 2015 - Proceedings, Association for Computing Machinery, Inc, Jan. 2015. doi: 10.1145/2701126.2701181.
Q. Cao and S. Wang, “Applying over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning,” in Proceedings - 2011 4th International Conference on Information Management, Innovation Management and Industrial Engineering, ICIII 2011, 2011, pp. 543–548. doi: 10.1109/ICIII.2011.276.
T. Sandhan and J. Y. Choi, “Handling imbalanced datasets by partially guided hybrid sampling for pattern recognition,” in Proceedings - International Conference on Pattern Recognition, Institute of Electrical and Electronics Engineers Inc., Dec. 2014, pp. 1449–1453. doi: 10.1109/ICPR.2014.258.
M. Koziarski and M. Wozniak, “CCR: A combined cleaning and resampling algorithm for imbalanced data classification,” International Journal of Applied Mathematics and Computer Science, vol. 27, no. 4, pp. 727–736, Dec. 2017, doi: 10.1515/amcs-2017-0050.
M. Nakamura, Y. Kajiwara, A. Otsuka, and H. Kimura, “LVQ-SMOTE-Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data,” 2013. [Online]. Available: http://www.biodatamining.org/content/6/1/16
B. Zhou, C. Yang, H. Guo, and J. Hu, “A Quasi-linear SVM Combined with Assembled SMOTE for Imbalanced Data Classification,” in The 2013 International Joint Conference on Neural Networks (IJCNN), 2013. doi: 10.1109/IJCNN.2013.6707035.
G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 20–29, 2004, doi: https://doi.org/10.1145/1007730.1007735.
A. Islam, S. B. Belhaouari, A. U. Rehman, and H. Bensmail, “KNNOR: An oversampling technique for imbalanced datasets[Formula presented],” Appl Soft Comput, vol. 115, Jan. 2022, doi: 10.1016/j.asoc.2021.108288.
S. Bej, K. Schulz, P. Srivastava, M. Wolfien, and O. Wolkenhauer, “A Multi-Schematic Classifier-Independent Oversampling Approach for Imbalanced Datasets,” IEEE Access, vol. 9, pp. 123358–123374, 2021, doi: 10.1109/ACCESS.2021.3108450.
T. Watthaisong, K. Sunat, and N. Muangkote, “Comparative Evaluation of Imbalanced Data Management Techniques for Solving Classification Problems on Imbalanced Datasets,” Statistics, Optimization and Information Computing, vol. 12, no. 2, pp. 547–570, Mar. 2024, doi: 10.19139/soic-2310-5070-1890.
G. Kovács, “Smote-variants: A python implementation of 85 minority oversampling techniques,” Neurocomputing, vol. 366, pp. 352–354, Nov. 2019, doi: 10.1016/j.neucom.2019.06.100.
M. J. Hernández-Molinos, A. J. Sánchez-García, R. E. Barrientos-Martínez, J. C. Pérez-Arriaga, and J. O. Ocharán-Hernández, “Software Defect Prediction with Bayesian Approaches,” Mathematics, vol. 11, no. 11, Jun. 2023, doi: 10.3390/math11112524.
A. Iqbal et al., “Performance analysis of machine learning techniques on software defect prediction using NASA datasets,” International Journal of Advanced Computer Science and Applications, vol. 10, no. 5, pp. 300–308, 2019, doi: 10.14569/ijacsa.2019.0100538.
M. N. M. Rahman, R. A. Nugroho, M. R. Faisal, F. Abadi, and R. Herteno, “Optimized multi correlation-based feature selection in software defect prediction,” Telkomnika (Telecommunication Computing Electronics and Control), vol. 22, no. 3, pp. 598–605, Jun. 2024, doi: 10.12928/TELKOMNIKA.v22i3.25793.
C. L. Prabha and N. Shivakumar, “Software Defect Prediction Using Machine Learning Techniques,” in 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184), 2020. doi: 10.1109/ICOEI48184.2020.9142909.
H. Alsawalqah et al., “Software defect prediction using heterogeneous ensemble classification based on segmented patterns,” Applied Sciences (Switzerland), vol. 10, no. 5, Mar. 2020, doi: 10.3390/app10051745.
Z. Tian, J. Xiang, S. Zhenxiao, Z. Yi, and Y. Yunqiang, “Software Defect Prediction based on Machine Learning Algorithms,” in 2019 IEEE 5th International Conference on Computer and Communications (ICCC), IEEE, 2020. doi: 10.1109/ICCC47050.2019.9064412.
B. Charbuty and A. Abdulazeez, “Classification Based on Decision Tree Algorithm for Machine Learning,” Journal of Applied Science and Technology Trends, vol. 2, no. 01, pp. 20–28, Mar. 2021, doi: 10.38094/jastt20165.
L. Breiman, “Random Forests,” Mach Learn, vol. 45, pp. 5–32, 2001, doi: https://doi.org/10.1023/A:1010933404324.
M. Schonlau and R. Y. Zou, “The random forest algorithm for statistical learning,” The Stata Journal: Promoting communications on statistics and Stata, vol. 20, no. 1, pp. 3–29, Mar. 2020, doi: 10.1177/1536867X20909688.
H. B. Kibria and A. Matin, “The Severity Prediction of The Binary And Multi-Class Cardiovascular Disease -- A Machine Learning-Based Fusion Approach,” Comput Biol Chem, vol. 98, Mar. 2022, doi: https://doi.org/10.1016/j.compbiolchem.2022.107672.
P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Mach Learn, vol. 63, no. 1, pp. 3–42, Apr. 2006, doi: 10.1007/s10994-006-6226-1.
E. K. Ampomah, Z. Qin, and G. Nyame, “Evaluation of tree-based ensemble machine learning models in predicting stock price direction of movement,” Information (Switzerland), vol. 11, no. 6, Jun. 2020, doi: 10.3390/info11060332.
U. Saeed, S. U. Jan, Y. D. Lee, and I. Koo, “Fault diagnosis based on extremely randomized trees in wireless sensor networks,” Reliab Eng Syst Saf, vol. 205, Jan. 2021, doi: 10.1016/j.ress.2020.107284.
M. Fawwaz Akbar, M. I. Mazdadi, H. Saragih, and F. Abadi, “Implementation of Information Gain Ratio and Particle Swarm Optimization in the Sentiment Analysis Classification of Covid-19 Vaccine Using Support Vector Machine,” Journal of Electronics, Electromedical Engineering, and Medical informatics (JEEEMI), vol. 5, no. 4, pp. 261–270, 2023, doi: 10.35882/jeemi.v5i4.328.
D. Valero-Carreras, J. Alcaraz, and M. Landete, “Comparing two SVM models through different metrics based on the confusion matrix,” Comput Oper Res, vol. 152, Apr. 2023, doi: 10.1016/j.cor.2022.106131.
C. Y. Lee and W. C. Lin, “Induction Motor Fault Classification Based on ROC Curve and t-SNE,” IEEE Access, vol. 9, pp. 56330–56343, 2021, doi: 10.1109/ACCESS.2021.3072646.
D. Chicco, N. Tötsch, and G. Jurman, “The matthews correlation coefficient (Mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation,” BioData Min, vol. 14, pp. 1–22, 2021, doi: 10.1186/s13040-021-00244-z.
Copyright (c) 2024 Wildan Nur Hidayatullah, Rudy Herteno, Mohammad Reza Faisal, Radityo Adi Nugroho, Setyo Wahyu Saputro, Zarif Bin Akhtar

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).