Impact of a Synthetic Data Vault for Imbalanced Class in Cross-Project Defect Prediction

Keywords: Class Imbalance, Cross Project Defect Prediction, Machine Learning, Software Defect Prediction, Synthetic Data Vault


Software Defect Prediction (SDP) is crucial for ensuring software quality. However, class imbalance (CI) poses a significant challenge in predictive modeling. This study delves into the effectiveness of the Synthetic Data Vault (SDV) in mitigating CI within Cross-Project Defect Prediction (CPDP). Methodologically, the study addresses CI across ReLink, MDP, and PROMISE datasets by leveraging SDV to augment minority classes. Classification utilizing Decision Tree (DT), Logistic Regression (LR), K-Nearest Neighbors (KNN), Naive Bayes (NB), and Random Forest (RF), also model performance is evaluated using AUC and t-Test. The results consistently show that SDV performs better than SMOTE and other techniques in various projects. This superiority is evident through statistically significant improvements. KNN dominance in average AUC results, with values 0.695, 0.704, and 0.750. On ReLink, KNN show 16.06% improvement over the imbalanced and 12.84% over SMOTE. Similarly, on MDP, KNN 20.71% improvement over the imbalanced and a 10.16% over SMOTE. Moreover, on PROMISE, KNN 13.55% improvement over the imbalanced and 7.01% over SMOTE. RF displays moderate performance, closely followed by LR and DT, while NB lags behind. The statistical significance of these findings is confirmed by t-Test, all below the 0.05 threshold. These findings underscore SDV's potential in enhancing CPDP outcomes and tackling CI challenges in SDV. With KNN as the best classification algorithm. Adoption of SDV could prove to be a promising tool for enhancing defect detection and CI mitigation


Download data is not yet available.


[1] S. Amasaki, “Cross-version defect prediction: use historical data, cross-project data, or both?,” Empir Softw Eng, vol. 25, no. 2, pp. 1573–1595, Mar. 2020, doi: 10.1007/s10664-019-09777-8.
[2] S. Noreen, R. Bin Faiz, S. Alyahya, and M. Maddeh, “Performance Evaluation of Convolutional Neural Network for Multi-Class in Cross Project Defect Prediction,” Applied Sciences (Switzerland), vol. 12, no. 23, Dec. 2022, doi: 10.3390/app122312269.
[3] S. Tang, S. Huang, C. Zheng, E. Liu, C. Zong, and Y. Ding, “A Novel Cross-Project Software Defect Prediction Algorithm Based on Transfer Learning,” Tsinghua Sci Technol, vol. 27, no. 1, pp. 41–57, 2022, doi: 10.26599/TST.2020.9010040.
[4] S. Pal and A. Sillitti, “Cross-Project Defect Prediction: A Literature Review,” IEEE Access, vol. 10. Institute of Electrical and Electronics Engineers Inc., pp. 118697–118717, 2022. doi: 10.1109/ACCESS.2022.3221184.
[5] Y. Zhao, Y. Zhu, Q. Yu, and X. Chen, “Cross-project defect prediction method based on manifold feature transformation,” Future Internet, vol. 13, no. 8, Aug. 2021, doi: 10.3390/fi13080216.
[6] Y. Z. Bala, P. A. Samat, K. Y. Sharif, and N. Manshor, “Improving Cross-Project Software Defect Prediction Method Through Transformation and Feature Selection Approach,” IEEE Access, vol. 11, pp. 2318–2326, 2023, doi: 10.1109/ACCESS.2022.3231456.
[7] U. S. Bhutamapuram and R. Sadam, “With-in-project defect prediction using bootstrap aggregation based diverse ensemble learning technique,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 10. King Saud bin Abdulaziz University, pp. 8675–8691, Nov. 01, 2022. doi: 10.1016/j.jksuci.2021.09.010.
[8] Z. Sun, J. Li, H. Sun, and L. He, “CFPS: Collaborative filtering based source projects selection for cross-project defect prediction,” Appl Soft Comput, vol. 99, Feb. 2021, doi: 10.1016/j.asoc.2020.106940.
[9] R. Vashisht and S. A. M. Rizvi, “Addressing Noise and Class Imbalance Problems in Heterogeneous Cross-Project Defect Prediction: An Empirical Study,” International Journal of e-Collaboration, vol. 19, no. 1, 2023, doi: 10.4018/IJeC.315777.
[10] M. Nevendra and P. Singh, “Cross-Project Defect Prediction with Metrics Selection and Balancing Approach,” Applied Computer Systems, vol. 27, no. 2, pp. 137–148, Dec. 2022, doi: 10.2478/acss-2022-0015.
[11] Y. Zhao, Y. Zhu, Q. Yu, and X. Chen, “Cross-Project Defect Prediction Considering Multiple Data Distribution Simultaneously,” Symmetry (Basel), vol. 14, no. 2, Feb. 2022, doi: 10.3390/sym14020401.
[12] S. Hosseini, B. Turhan, and D. Gunarathna, “A systematic literature review and meta-analysis on cross project defect prediction,” IEEE Transactions on Software Engineering, vol. 45, no. 2. Institute of Electrical and Electronics Engineers Inc., pp. 111–147, Feb. 01, 2019. doi: 10.1109/TSE.2017.2770124.
[13] K. K. Bejjanki, S. P. Kanchanapally, and M. K. Thota, “Class Imbalance Reduction and Centroid based Relevant Project Selection for Cross Project Defect Prediction,” International Journal on Recent and Innovation Trends in Computing and Communication, vol. 11, no. 6 s, pp. 293–302, Jun. 2023, doi: 10.17762/ijritcc.v11i6s.6933.
[14] Y. Xing, W. Lin, X. Lin, B. Yang, and Z. Tan, “Cross-Project Defect Prediction Based on Two-Phase Feature Importance Amplification,” Comput Intell Neurosci, vol. 2022, 2022, doi: 10.1155/2022/2320447.
[15] Z. Li, X. Zhang, J. Guo, and Y. Shang, “Class Imbalance Data-Generation for Software Defect Prediction,” in Proceedings - Asia-Pacific Software Engineering Conference, APSEC, IEEE Computer Society, Dec. 2019, pp. 276–283. doi: 10.1109/APSEC48747.2019.00045.
[16] N. Limsettho, K. E. Bennin, J. W. Keung, H. Hata, and K. Matsumoto, “Cross project defect prediction using class distribution estimation and oversampling,” Inf Softw Technol, vol. 100, pp. 87–102, Aug. 2018, doi: 10.1016/j.infsof.2018.04.001.
[17] S. Kumar Pandey and A. Kumar Tripathi, “An Empirical Study towards dealing with Noise and Class Imbalance issues in Software Defect Prediction,” Soft Computing, vol. 25, pp. 13465–13492, 2021, doi: 10.21203/
[18] X. Yi, Y. Xu, Q. Hu, S. Krishnamoorthy, W. Li, and Z. Tang, “ASN-SMOTE: a synthetic minority oversampling method with adaptive qualified synthesizer selection,” Complex and Intelligent Systems, vol. 8, no. 3, pp. 2247–2272, Jun. 2022, doi: 10.1007/s40747-021-00638-w.
[19] S. Zheng, J. Gai, H. Yu, H. Zou, and S. Gao, “Training data selection for imbalanced cross-project defect prediction,” Computers and Electrical Engineering, vol. 94, Sep. 2021, doi: 10.1016/j.compeleceng.2021.107370.
[20] A. Saifudin, S. W. H. L. Hendric, B. Soewito, F. L. Gaol, E. Abdurachman, and Y. Heryadi, “Tackling Imbalanced Class on Cross-Project Defect Prediction Using Ensemble SMOTE,” in IOP Conference Series: Materials Science and Engineering, Institute of Physics Publishing, Nov. 2019. doi: 10.1088/1757-899X/662/6/062011.
[21] X. Fan, S. Zhang, K. Wu, W. Zheng, and Y. Ge, “Cross-Project Software Defect Prediction Based on SMOTE and Deep Canonical Correlation Analysis,” Computers, Materials & Continua, vol. 0, no. 0, pp. 1–10, 2023, doi: 10.32604/cmc.2023.046187.
[22] L. Goel, M. Sharma, S. K. Khatri, and D. Damodaran, “Cross-project defect prediction using data sampling for class imbalance learning: an empirical study,” International Journal of Parallel, Emergent and Distributed Systems, vol. 36, no. 2, pp. 130–143, 2021, doi: 10.1080/17445760.2019.1650039.
[23] A. Iqbal, S. Aftab, and F. Matloob, “Performance Analysis of Resampling Techniques on Class Imbalance Issue in Software Defect Prediction,” International Journal of Information Technology and Computer Science, vol. 11, no. 11, pp. 44–53, Nov. 2019, doi: 10.5815/ijitcs.2019.11.05.
[24] L. Torgo, R. P. Ribeiro, B. Pfahringer, and P. Branco, “SMOTE for regression,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2013, pp. 378–389. doi: 10.1007/978-3-642-40669-0_33.
[25] A. Fernández, S. García, F. Herrera, and N. V Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” 2018.
[26] W. Wang and T. W. Pai, “Enhancing Small Tabular Clinical Trial Dataset through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP,” Data (Basel), vol. 8, no. 9, Sep. 2023, doi: 10.3390/data8090135.
[27] J. Lu et al., “Deep learning model to predict exercise stress test results: Optimizing the diagnostic test selection strategy and reduce wastage in suspected coronary artery disease patients,” Comput Methods Programs Biomed, vol. 240, Oct. 2023, doi: 10.1016/j.cmpb.2023.107717.
[28] A. X. Wang, S. S. Chukova, A. Sporle, B. J. Milne, C. R. Simpson, and B. P. Nguyen, “Enhancing public research on citizen data: An empirical investigation of data synthesis using Statistics New Zealand’s Integrated Data Infrastructure,” Inf Process Manag, vol. 61, no. 1, Jan. 2024, doi: 10.1016/j.ipm.2023.103558.
[29] Y. Zhong, K. Song, S. K. Lv, and P. He, “An Empirical Study of Software Metrics Diversity for Cross-Project Defect Prediction,” Math Probl Eng, vol. 2021, 2021, doi: 10.1155/2021/3135702.
[30] H. Tong, B. Liu, S. Wang, and Q. Li, “Transfer-Learning Oriented Class Imbalance Learning for Cross-Project Defect Prediction,” Jan. 2019, [Online]. Available:
[31] A. A. Khan, O. Chaudhari, and R. Chandra, “A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation,” Expert System With Applications, vol. 244, Apr. 2023, doi:
[32] A. O. Balogun et al., “SMOTE-Based Homogeneous Ensemble Methods for Software Defect Prediction,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer Science and Business Media Deutschland GmbH, 2020, pp. 615–631. doi: 10.1007/978-3-030-58817-5_45.
[33] Y. Khatri and S. K. Singh, “Cross project defect prediction: a comprehensive survey with its SWOT analysis,” Innov Syst Softw Eng, vol. 18, no. 2, pp. 263–281, Jun. 2022, doi: 10.1007/s11334-020-00380-5.
[34] S. DEMİR and E. K. ŞAHİN, “Evaluation of Oversampling Methods (OVER, SMOTE, and ROSE) in Classifying Soil Liquefaction Dataset based on SVM, RF, and Naïve Bayes,” European Journal of Science and Technology, Feb. 2022, doi: 10.31590/ejosat.1077867.
[35] S. Mehta and K. S. Patnaik, “Improved prediction of software defects using ensemble machine learning techniques,” Neural Comput Appl, vol. 33, no. 16, pp. 10551–10562, Aug. 2021, doi: 10.1007/s00521-021-05811-3.
[36] A. Figueira and B. Vaz, “Survey on Synthetic Data Generation, Evaluation Methods and GANs,” Mathematics, vol. 10, no. 15. MDPI, Aug. 01, 2022. doi: 10.3390/math10152733.
[37] A. Gonzales, G. Guruswamy, and S. R. Smith, “Synthetic data in health care: A narrative review,” PLOS Digital Health, vol. 2, no. 1, p. e0000082, Jan. 2023, doi: 10.1371/journal.pdig.0000082.
[38] A. Salazar, L. Vergara, and G. Safont, “Generative Adversarial Networks and Markov Random Fields for oversampling very small training sets,” Expert Syst Appl, vol. 163, Jan. 2021, doi: 10.1016/j.eswa.2020.113819.
[39] K. Zhang, N. Patki, and K. Veeramachaneni, “Sequential Models in the Synthetic Data Vault,” Jul. 2022, doi: 10.48550/ARXIV.2207.14406.
[40] T. Kokosi and K. Harron, “Synthetic data in medical research,” BMJ Medicine, vol. 1, no. 1, p. e000167, Sep. 2022, doi: 10.1136/bmjmed-2022-000167.
[41] A. Montanez, “SDV: An Open Source Library for Synthetic Data Generation,” M.Eng. thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2018.
[42] DataCebo Team, “Synthetic Data Vault.” Accessed: Apr. 03, 2024. [Online]. Available:
[43] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, “Modeling Tabular data using Conditional GAN,” Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 7335–7345, Jun. 2019, [Online]. Available:
[44] F. Benali, D. Bodénès, N. Labroche, and C. De Runz, “MTCopula: Synthetic Complex Data Generation Using Copula,” 2021. [Online]. Available:
[45] C. Pak, T. T. Wang, and X. H. Su, “An Empirical Study on Software Defect Prediction Using Over-Sampling by SMOTE,” International Journal of Software Engineering and Knowledge Engineering, vol. 28, no. 6, pp. 811–830, Jun. 2018, doi: 10.1142/S0218194018500237.
[46] A. A. Khan, O. Chaudhari, and R. Chandra, “A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation,” Expert Syst Appl, vol. 244, Apr. 2024, doi:
[47] S. Pal, “Generative Adversarial Network-based Cross-Project Fault Prediction,” May 2021, [Online]. Available:
[48] W. Li, Y. Chen, and Y. Song, “Boosted K-nearest neighbor classifiers based on fuzzy granules,” Knowl Based Syst, vol. 195, May 2020, doi: 10.1016/j.knosys.2020.105606.
[49] P. Pietrzak and M. Wolkiewicz, “On-line detection and classification of pmsm stator winding faults based on stator current symmetrical components analysis and the knn algorithm,” Electronics (Switzerland), vol. 10, no. 15, Aug. 2021, doi: 10.3390/electronics10151786.
[50] W. B. Zulfikar, A. R. Atmadja, and S. F. Pratama, “Sentiment Analysis on Social Media Against Public Policy Using Multinomial Naive Bayes,” Scientific Journal of Informatics, vol. 10, no. 1, pp. 25–34, Jan. 2023, doi: 10.15294/sji.v10i1.39952.
[51] A. V. D. Sano, A. A. Stefanus, E. D. Madyatmadja, H. Nindito, A. Purnomo, and C. P. M. Sianipar, “Proposing a visualized comparative review analysis model on tourism domain using Naïve Bayes classifier,” in Procedia Computer Science, Elsevier B.V., 2023, pp. 482–489. doi: 10.1016/j.procs.2023.10.549.
[52] G. Zhu et al., “Naive Bayes Classifiers for Music Emotion Classification Based on Lyrics,” in EEE/ACIS 16th International Conference on Computer and Information Science (ICIS), Wuhan, China, May 2017, pp. 635–638. doi: 10.1109/ICIS.2017.7960070.
[53] K. A. Dhanya, S. Vajipayajula, K. Srinivasan, A. Tibrewal, T. S. Kumar, and T. G. Kumar, “Detection of Network Attacks using Machine Learning and Deep Learning Models,” in Procedia Computer Science, Elsevier B.V., 2023, pp. 57–66. doi: 10.1016/j.procs.2022.12.401.
[54] N. Saran and N. Kesswani, “A comparative study of supervised Machine Learning classifiers for Intrusion Detection in Internet of Things,” in Procedia Computer Science, Elsevier B.V., 2022, pp. 2049–2057. doi: 10.1016/j.procs.2023.01.181.
[55] V. Maulida, R. Herteno, M. R. Faisal, D. Kartini, and F. Abadi, “Feature Selection Using Firefly Algorithm with Tree-Based Classification in Software Defect Prediction,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 5, no. 4, pp. 223–230, 2023, doi: 10.35882/jeemi.v5i4.315.
[56] M. K. Suryadi, R. Herteno, S. W. Saputro, M. R. Faisal, and R. A. Nugroho, “A Comparative Study of Various Hyperparameter Tuning on Random Forest Classification with SMOTE and Feature Selection Using Genetic Algorithm in Software Defect Prediction,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 137–147, Apr. 2024, doi: 10.35882/jeeemi.v6i2.375.
[57] T. Ciu and R. S. Oetama, “Logistic Regression Prediction Model for Cardiovascular Disease,” International Journal of New Media Technology, vol. VII, no. 1, p. 33, 2020, doi: 10.31937/ijnmt.v7i1.1340.
[58] R. T. Yunardi, R. Apsari, and M. Yasin, “Comparison of Machine Learning Algorithm For Urine Glucose Level Classification Using Side-Polished Fiber Sensor,” 2020. [Online]. Available:
[59] A. Balboa, A. Cuesta, J. González-Villa, G. Ortiz, and D. Alvear, “Logistic regression vs machine learning to predict evacuation decisions in fire alarm situations,” Saf Sci, vol. 174, Jun. 2024, doi: 10.1016/j.ssci.2024.106485.
[60] R. Malhotra, R. Kapoor, P. Saxena, and P. Sharma, “SAGA: A Hybrid Technique to handle Imbalance Data in Software Defect Prediction,” in ISCAIE 2021 - IEEE 11th Symposium on Computer Applications and Industrial Electronics, Institute of Electrical and Electronics Engineers Inc., Apr. 2021, pp. 331–336. doi: 10.1109/ISCAIE51753.2021.9431842.
[61] M. H. Murad, A. K. Balla, M. S. Khan, A. Shaikh, S. Saadi, and Z. Wang, “Thresholds for interpreting the fragility index derived from sample of randomised controlled trials in cardiology: a meta-epidemiologic study,” BMJ Evid Based Med, vol. 28, no. 2, pp. 133–136, 2023, doi: 10.1136/bmjebm-2021-111858.
[62] N. A. A. Khleel and K. Nehéz, “A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method,” J Intell Inf Syst, vol. 60, no. 3, pp. 673–707, Jun. 2023, doi: 10.1007/s10844-023-00793-1.
[63] J. L. Ortega, “The presence of academic journals on Twitter The presence of academic journals on Twitter and its relationship with dissemination (tweets) and research impact (citations),” Aslib Journal of Information Management, vol. 69, no. 6, pp. 674–687, 2017, doi:
How to Cite
Putri Nabella, Rudy Herteno, Setyo Wahyu Saputro, Mohammad Reza Faisal, and Friska Abadi, “Impact of a Synthetic Data Vault for Imbalanced Class in Cross-Project Defect Prediction”,, vol. 6, no. 2, pp. 219-230, Apr. 2024.
Research Paper