Impact of a Synthetic Data Vault for Imbalanced Class in Cross-Project Defect Prediction

Software Defect Prediction (SDP) is crucial for ensuring software quality. However, class imbalance (CI) poses a significant challenge in predictive modeling. This study introduces a novel approach by employing the Synthetic Data Vault (SDV) to tackle CI within Cross-Project Defect Prediction (CPDP). Methodologically, the study addresses CI across multiple datasets (ReLink, MDP, and PROMISE) by leveraging SDV to augment minority classes. Classification utilizing Decision Tree (DT), Logistic Regression (LR), K-Nearest Neighbors (KNN), Naive Bayes (NB), and Random Forest (RF), also model performance is evaluated using AUC and t-Test. The results consistently show that SDV performs better than SMOTE and other techniques in various projects. This superiority is evident through statistically significant improvements. KNN dominance in average AUC results, with values 0.695, 0.704, and 0.750. On ReLink, KNN show 16.06% improvement over the imbalanced and 12.84% over SMOTE. Similarly, on MDP, KNN 20.71% improvement over the imbalanced and a 10.16% over SMOTE. Moreover, on PROMISE, KNN 13.55% improvement over the imbalanced and 7.01% over SMOTE. RF displays moderate performance, closely followed by LR and DT, while NB lags behind. Overall, SDV got an improvement of 10.10% from imbalanced, and 7.54% from SMOTE. The statistical significance of these findings is confirmed by t-Test, all below the 0.05 threshold. The practical implication of adopting SDV for defect detection and CI mitigation lies in its demonstrated effectiveness, particularly with KNN as the best classification algorithm, showcasing promising potential to enhance software quality by addressing CI and improving predictive modeling outcomes.


I. INTRODUCTION
Modern software development has undergone a profound Software development has evolved significantly, marked by increasing complexities in coding and implementation processes, necessitating meticulous attention to ensure defectfree outcomes [1].Despite substantial advancements in software engineering, challenges persist, particularly in the identification and rectification of software defects, which are vital for businesses to mitigate unforeseeable financial losses [2] [3].To address these challenges, preemptive measures are essential, underscoring the importance of defect prediction methodologies in software engineering [4].
Software Defect Prediction (SDP) has emerged as a critical focus within software engineering, dedicated to systematically identifying flawed components within software projects [5].These predictive models play a pivotal role in discerning segments of the software system with elevated probabilities of harboring defects, thereby facilitating efficient allocation of testing resources [6].Among the various SDP methodologies, Within-Project Defect Predictions (WPDP) stand out, integrating models within the broader framework of SDP [7].
However, traditional SDP approaches encounter limitations, particularly in scenarios where historical data from locally accessible projects is lacking, rendering WPDP nonviable [8], [9] Consequently, researchers have shifted their attention towards emerging methodologies, prominently including Cross Project Defect Prediction (CPDP) [4].
To tackle the challenge of CI in software defect prediction, numerous studies have explored over-sampling techniques, among which the Synthetic Minority Over-sampling Technique (SMOTE) has emerged as a widely adopted approach [16].In this study [17], compares two techniques for handling imbalanced data: oversampling with SMOTE and undersampling with Random Undersampling (RUS), using Gradient Boosting (GB) and RF as classification algorithms.Initially, on the original unbalanced dataset, the AUC values were 0.635 for GB and 0.644 for RF.However, after applying SMOTE, the AUC values increased to 0.649 for GB and 0.667 for RF.Conversely, by using RUS, AUC values of 0.644 for GB and 0.650 for RF were obtained.
These findings demonstrate that employing SMOTE in both classification algorithms resulted in a significant enhancement in model performance, while the use of RUS yielded insignificant changes.Therefore, SMOTE can be considered an effective method for addressing CI in the PROMISE dataset.However, it's important to note that this study only utilizes CK metrics, incorporating a subset of six attributes out of a total of 20 available attributes.This approach was adopted to focus on revealing the relationship between defects in object-oriented projects and CM metrics.
Other research has also investigated the efficacy of SMOTE as a method for addressing CI in CPDP.In this study [18], SMOTE combined with AdaBoost (AD-SMOTE) was utilized to mitigate misclassification, resulting in an AUC of 0.664.Another study [19] employed SMOTE in conjunction with Deep Canonical Correlation Analysis (S-DCCA) to calculate correlations and selectively utilize subsets characterized only by features with high correlation, leading to an AUC of 0.632.SMOTE's popularity stems from its ability to enhance class balance without sacrificing valuable minority samples, showcasing its efficacy across various studies.While originally devised for classification tasks, SMOTE's adaptability has extended to addressing regression challenges as well [20].Over the past decade, SMOTE has proven its utility across diverse domains, yielding significant contributions to various applications [21].
However, it is crucial to acknowledge that SMOTE is not without limitations.Despite its effectiveness, SMOTE may oversimplify the minority class, potentially resulting in instances that fail to capture the complexity of real-world data.Furthermore, the introduction of noise or bias into synthetic data poses challenges to the performance of defect prediction models and potentially leading to overfitting [22], [23].
In response, this study proposes the adoption of synthetic data from the Synthetic Data Vault (SDV) as an alternative approach.Synthetic data generated by data synthesizers have been shown to better represent the original data distribution, offering potential advantages over traditional methods [24], [25].SDV features a number of approaches that each offer their unique advantages.GAN have proven to be powerful, generating high-quality, diverse synthetic data that closely resembles the original dataset.GAN improve model performance through data augmentation [26].Conditional GAN (CT-GAN) enhance this innovation by generating data with certainty of discrete values, overcoming CI, and enriching the dataset with specialized information [26].Copula GAN differentiates itself by utilizing copula functions in the generative process, offering greater interpretability and flexibility in capturing relationships between variables [27].The Gaussian Copula is distinguished by its remarkable capacity to generate synthetic data effectively calibrate noise, attributed to its flexibility in describing dependencies between random variables [28].Variational Autoencoders (VAE) capture the underlying data distribution using nonparametric approaches, providing a powerful alternative in tabular data generation [29].Overall, the SDV approach offers a diverse set of tools with specific advantages for addressing challenges in synthetic data generation across various applications.
This study endeavors to assess the efficacy of SDV techniques in mitigating CI within CPDP.This involves utilizing five frequently used classification algorithms [30], including Decision Tree (DT), Logistic Regression (LR), K-Nearest Neighbors (KNN), Naive Bayes (NB), and Random Forest (RF), with the evaluation metric being the AUC.The research focuses on leveraging original samples from the minority class within CPDP datasets to create new synthetic instances.This approach directly addresses CI between the majority and minority classes, thereby enhancing the overall effectiveness and fairness of CPDP models.The contribution of this study is as follows: a. Introduction of SDV as an alternative approach to traditional oversampling techniques like SMOTE for mitigating class imbalance in CPDP.b.Identification of the optimal classification method among the five most commonly utilized algorithms in CPDP.

II. METHOD
The proposed methodology presents a meticulously structured approach to designing and implementing trials by harnessing the ReLink, NASA MDP, and PROMISE, within a computational framework, specifically leveraging Google Colab and Python programming.FIGURE 1 shows a flowchart that we used in this study.Within this methodology, one dataset is designated as the target project, while the others serve as source projects.To effectively tackle CI issues inherent in the datasets, synthetic data is generated using the SDV, a proficient tool in developing generative models within relational databases.SDV facilitates data synthesis by selectively sampling across database components post-model formulation, ensuring adherence to underlying structural constraints [31].Moreover, the study incorporates the utilization of five classification algorithms, namely DT, LR, KNN, NB, and RF, to conduct a comprehensive assessment of defect prediction effectiveness across multiple projects.This evaluation employs the 10-fold cross-validation technique and utilizes metrics such as the AUC to measure the performance of each algorithm.

A. DATA COLLECTION
The study employs three datasets: ReLink, NASA MDP, and PROMISE datasets which is a publicly available dataset that widely applied in various domains [4], [19].Within the ReLink dataset, three projects are featured: Apache, Safe, and Zxing.The NASA MDP dataset is focused on five specific projects out of twelve, namely CM1, MW1, PC1, PC3, and PC5, chosen due to their shared attributes, thereby eliminating the necessity for attribute selection for CPDP [18].In the PROMISE dataset, 11 projects are integrated, including ant-1.7,camel-1.4,ivy-1.1,jedit-4.2,log4j-1.0,lucene-2.4,poi-3.0,synapse-1.2,velocity-1.6,xalan-2.4,and xerces-1.3.This selection rationale is driven by the utilization of multiversion datasets, where a singular version is chosen as the distributions of two versions within a project may exhibit high similarity, potentially even identical [32].Access to the ReLink and NASA MDP datasets is available through the following link: https://github.com/bharlow058/AEEEM-andother-SDP-datasets[33], whereas the PROMISE dataset can be obtained from: https://github.com/feiwww/PROMISE-backup[34].TABLE 1 is shows, which contains information and some general statistics about each of the datasets used.

B. PREPROCESSING
In the data preprocessing phase, attributes containing categorical values are converted to nominal values, specifically 0 and 1.For instance, within the ReLink dataset, the attribute 'isDefective' represents 'bug' as 1 and 'clean' as 0. Similarly, in the NASA MDP dataset, the 'Defective' attribute denotes 'Y' as 1 and 'N' as 0. Likewise, within the PROMISE dataset, the 'bug' attribute designates values other than 0 as 1.

C. OVERSAMPLING WITH SYNTHETIC DATA VAULT
Within the software defect dataset, most of the data exhibits a significantly larger proportion of non-defective samples compared to defective ones [31].CI often results in bias within machine learning models towards the majority class [35].Given the critical importance of accurately predicting the defective class, addressing CI becomes imperative prior to constructing CPDP models [36], [37].Synthetic oversampling techniques, such as SMOTE, are employed to address the imbalance by generating artificial minority instances and rebalancing the dataset [38], [39].However, concerns arise regarding the fidelity of replicating the original dataset with conventional oversampling techniques [22].Synthetic data, intentionally manufactured to resemble real-world data, presents a promising strategy to overcome such issues, potentially offering higher quality than directly obtained or measured data [40].Synthetic data retains a robust set of variables essential for supporting relevant multivariate analyses [41].In a previous investigation utilizing fMRI images from an open-access database, the efficacy of CI mitigation through synthetic data shaping techniques was found to surpass that of SMOTE [42].Therefore, the SDV, an end-to-end framework for modeling and generating synthetic sequential data tailored for tabular datasets [43], will be utilized to create minority data and address the CI problem in CPDP.Constructed with precision, these models aim to capture and estimate the correlations and distributions among different variables found within the original dataset [44].
During the initial phase of the SDV redesign process, the system was augmented with two additional libraries to optimize its functionality.Reversible Data Transforms (RDT) were employed by SDV to preprocess tables, which underwent iterative processing facilitated by Copulas for modeling purposes [45].Presently, SDV offers various options for modeling single tables, including Copula GAN, CTGAN, Fast ML Preset, Gaussian Copula, and TVAE [46].Furthermore, SDV consists of interconnected modules, each serving distinct functionalities.Here are some APIs for the modules within SDV.

META FILE
The primary phase of the operation encompasses the acquisition of the dataset.Following this, SDV mandates access to metadata concerning the dataset, encompassing attributes such as column data types.This requisite information is encapsulated within a JSON structure denoted as the meta file, serving as a foundational element essential for the execution of SDV procedures on the dataset in question [45].

DATA LOADER
The CSVDataLoader class is initialized with a meta file supplied as input parameter.Upon instantiation, this file is stored internally as an attribute named 'meta'.Subsequent to this initialization, the DataNavigator class is instantiated utilizing the details provided within the meta file to identify and load the corresponding CSV files as pandas DataFrames.
A dictionary structure is then created, associating each table's name with an instance of the Table class.Each Table instance encapsulates both the metadata and DataFrame specific to its corresponding table.This amalgamation of information serves as the foundation for the instantiation of a DataNavigator instance.The DataNavigator, thus created, encapsulates the necessary information and functionalities required for navigating through the dataset effectively.Finally, this instantiated DataNavigator is returned by the 'loadData' method for further utilization [45].

DATA NAVIGATOR
DataNavigator serves as a crucial component for both data navigation and modeling, housing pertinent information regarding the dataset's structure.Its primary functionalities encompass accessing child or parent tables, retrieving data from tables, obtaining table metadata, applying data transformations, and discerning relationships between tables.
A key operation performed by DataNavigator is the get_relationships method, wherein it meticulously traces the dataset's structure, storing essential details regarding intertable relationships, including parent-child associations and primary-foreign key mappings.Such insights are fundamental for the subsequent data modeling endeavors [45].

MODELER
The SDV modeling technique utilizes Conditional Parameter Aggregation (CPA) and Recursive Conditional Parameter Aggregation (RCPA) to characterize relationships among tables in a dataset.CPA consolidates conditional parameters within individual tables, while RCPA extends these parameters recursively to all descendant tables, starting from leaf nodes and progressing towards the root node.The modelDatabase function identifies dataset roots, initiates RCPA, and stores the resultant models in the Modeler attribute, enabling efficient modeling of intricate relational structures.The Modeler class possesses the capacity to store numerous models and is adaptable to various types of models utilized, such as Copula or others [45].
Let D represent a database comprising numerous tables, denoted as T. The interconnections among these tables are established, thus C(T) signifies the set of children of T, while P(T) denotes the set of parents of T.
Since the CPA method returns the extended table, line 4 of the algorithm stores the extended tables as T. Subsequently, line 5 preprocesses T to convert the values into numerical data.The base case of this algorithm is for leaf tables, where C(T)=∅.During the creation of the overall model by SDV, it applies RCPA and uses the result to calculate the database model.The SDV's modeling algorithm invokes the RCPA method on all tables without parents.Due to the recursive nature of RCPA, this ensures that all tables in the database ultimately undergo the CPA method [47].for all C ∈ C(T) do 3.

SAMPLER
Following the completion of modeling, the ultimate phase in data synthesis entails the sampling of new data.This task is executed by the Sampler class, which is initialized with an instance of the Modeler class.Utilizing the insights gleaned from the Modeler, the Sampler orchestrates the generation of synthetic data.Its core objective is to offer a spectrum of sampling methods catering to diverse user requisites.Thus, users merely need to furnish a Modeler instance and a DataNavigator instance to the Sampler, facilitating the invocation of relevant sampling methods and subsequent data sampling.The Sampler maintains all sampled data within a dictionary structure, wherein each table's name is correlated with the respective sampled rows [45].
From the user's standpoint, SDV entails discrete stages, namely data preparation, modeling, sampling, and evaluation.

DATA PREPARATION
In the data preparation phase, the initial step involves loading the data as a pandas DataFrame object.Subsequently, the data undergoes conversion into metadata using the SingleTableMetadata approach, which meticulously describes each table.This metadata encompasses details such as the data type for each column, the primary key, and other pertinent identifiers [46].

MODELING
During the modeling phase, synthetic data is generated based on the prepared metadata.This process involves employing a synthesizer that utilizes the original data as a foundation.Throughout this stage, the synthesizer discerns the underlying patterns within the original dataset.Various synthesizers are utilized in this modeling phase, including Copula GAN, CTGAN, Fast ML Preset, Gaussian Copula, and TVAE [46].
Due to the limited elucidation provided for each modeling aspect within the documentation or paper concerning SDV, the following is a little explanation that can be summarized from various sources.a) Copula GAN: This hybrid synthesizer integrates classical statistics with GAN-based deep learning techniques, offering a comprehensive approach to data modeling [46].In the realm of GAN, there are two main components: the discriminator (D) and the generator (G).
The discriminator aims to distinguish real data from fake, while the generator tries to produce data that looks real.The equation represents a game where the generator minimizes its likelihood of being detected by the discriminator, while the discriminator maximizes its ability to differentiate real from fake.At equilibrium, the generator creates data indistinguishable from real, and the discriminator can't reliably tell real from fake, achieving a balance where the generated data distribution matches the real data distribution [48].
b) Conditional Tabular GAN (CTGAN): Employs a GANbased method to model the distribution of tabular data and sample rows from it [49].CTGAN assesses the dissimilarity between the acquired conditional distribution and the real data's conditional distribution.[46].It introduces an innovative approach known as indel-coding methodology, where each indel in the input sequence is represented as either present ('1') or absent ('0').This binary representation is then utilized in a machine learning-based algorithm to estimate the likelihood of gap characters in ancestral sequences.Initially, Fast ML employs a simple coding scheme to convert all indels into binary format, indicating their presence or absence.The resulting binary data matrix serves as input to an ML-based ancestral indel reconstruction algorithm [51].However, it's important to note that there isn't a single equation that encapsulates the entirety of a machine learning model [52].d) Gaussian Copula: Copula models provide an efficient approach to capturing both inter-variable dependencies and individual behaviors.They prove especially valuable for synthesizing datasets from complex, smaller real datasets [53].Each column in the table is indexed from 0, 1, ..., n, with each column having its Cumulative Distribution Function (CDF) denoted as  0    respectively.Subsequently, each row of the table is treated as a vector  = ( 0 ,  1 , … ,   ).The Gaussian Copula is then applied to transform the row vector.Mathematically, this transformation can be expressed as: In this equation,  −1 (  (  )) represents the inverse cumulative distribution function of the Gaussian distribution applied to the original distribution [54].e) Tabular Variational Autoencoder (TVAE): Implementing the Variational Autoencoder (VAE) approach, this synthesizer consists of an encoder for compressing input data into a low-dimensional latent space and a decoder for reconstructing output data based on the learned representation from the encoder [24].This equation delineates a constraint derived from VAE methodology, which elucidates the interplay between latent variable z and observed variable x.Within the VAE framework, z adheres to a predetermined prior distribution p(z), typically a standard normal distribution.The choice of likelihood distribution p(x|z) varies depending on the task, being either Normal or Bernoulli.The fundamental objective is to derive the posterior distribution of the latent variable, denoted as p(z|x).However, the true posterior is challenging to compute for continuous latent spaces like z [55], [56].

SAMPLING
Following the conclusion of the modeling process, the synthesizer possesses the capability to produce synthetic data.In this context, the generated synthetic data specifically targets the minority class, addressing the issue of data imbalance [46].

DIAGNOSTIC
The Diagnostic Report performs fundamental checks on data format and validity.Specifically, it applies the TableStructure metric to each table in the dataset to ensure consistency.This metric compares the column names between the synthetic and real data.By identifying all column names in both datasets, it calculates a score based on the overlap between the columns. = ∩   ∪  (4) A score of 100% indicates perfect alignment, meaning the synthetic data shares identical column names with the real data [46].

C. Synthetic Minority Oversampling Technique
The Synthetic Minority Over-Sampling Technique (SMOTE) is employed as an oversampling method to mitigate CI in datasets [57].This technique leverages original samples from the minority class to generate new synthetic instances.Unlike traditional data space approaches, SMOTE operates in feature space for synthesizing instances [26].In this study, the assessment outcomes derived from the SMOTE will be juxtaposed with those obtained from SDV and unbalanced datasets.This comparative strategy enables a comprehensive evaluation of SDV's efficacy in addressing dataset imbalance by contrasting it with alternative methodologies such as SMOTE.The equation of SMOTE, represented as follows: This equation generates a new synthetic sample, denoted as , by linearly blending between an original sample, x, and another sample, [].The degree of blending is determined by a random factor, rand (0,1), which adjusts the difference between x and [].This random factor introduces variability into the process of generating the synthetic sample [58].

D. CLASSIFICATION ALGORITHM
In recent years, researchers have increasingly focused on the classification stage, which represents the final phase of prediction models.This stage has been the subject of intense scrutiny aimed at enhancing the efficiency of CPDP models and improving classifier performance.As such, this study adopts the five most prevalent classification methods utilized in CPDP [30].

K-NEAREST NEIGHBORS
The K-Nearest Neighbors (KNN) algorithm is highly regarded for its versatility, as it refrains from imposing stringent assumptions regarding the underlying data distribution.KNN achieves remarkable classification accuracy by leveraging the proximity of data points and making decisions based on the majority class among the nearest neighbors, a methodology that frequently yields favorable outcomes across diverse datasets [59].Upon deployment, KNN classifies new data points by scrutinizing the predominant class among their nearest neighbors within a predefined neighborhood size, denoted as the K value.This approach ensures both adaptability and efficacy in classification tasks.The Euclidean distance stands as the fundamental formulation utilized in the KNN, represented as: (, ) = √∑ (  −   ) 2  =1 (6) In this equation,   and   represent elements of the feature vectors x and y from sets A and B, respectively.The variable n denotes the dimensionality of the feature space, encompassing the number of features considered in the comparison [60].

NAIVE BAYES
Naive Bayes (NB) is a probabilistic machine learning technique employed for classification tasks [18].It determines the highest probability value and assigns the test data to the most suitable category based on this calculation The classifier derives its name from the "naive" assumption that all features are independent of each other given a class label.While this assumption is often violated in real-world contexts, Naive Bayes classifiers can still yield satisfactory outcomes in numerous scenarios [62].This simplicity and robust classification performance contribute to NB being widely adopted as a classification algorithm [63].The equation of NB, represented as: () (7) In this equation, X represents data with an unknown class, while H stands for the hypothesis that X belongs to a specific class.The term (|) denotes the probability of hypothesis HH given the data X, known as the posterior probability.() represents the prior probability of hypothesis H, while (|) signifies the probability of observing data X given hypothesis H. Finally, () represents the overall probability of observing data X [64].

DECISION TREE
Decision Tree (DT) classifier stands out as a computational model revered for its multistage decision-making process, adept at handling both numerical and nominal data types.Its hierarchical structure comprises decision nodes and leaf nodes, facilitating the creation of efficient decision rules [65].In essence, there exist two primary types of nodes within this structure: decision nodes and leaf nodes.Decision nodes play a crucial role in establishing decision rules by segmenting the data into different sections based on specific criteria.Conversely, leaf nodes represent the ultimate outcomes or conclusions derived from these decision rules and do not lead to further subdivisions or branches.Thus, while decision nodes steer the tree's structure, leaf nodes furnish the final decisions or predictions [66].The entropy equation serves as a pivotal tool in DT analysis, particularly when calculating the impurity at a node, represented as:  () = − ∑ ().log 2 (()) (8) In this equation,  () represents the entropy of the dataset P, where () denotes the probability that an instance in dataset P belongs to class i [67].

RANDOM FOREST
Random Forest (RF) algorithm is a supervised classification technique utilized in creating a forest through a randomized procedure [68].Initially, it identifies the root node employing the most effective splitting technique.This process is then replicated for the child nodes, utilizing the same optimal splitting method.The iterative nature of this cycle results in the construction of a complete tree, with the desired outcome at the leaf nodes.Subsequently, the algorithm repeats these steps to generate multiple trees, each with its random selection of features and splits [69].RF execution involves a structured process.It begins with bootstrapping, where samples of size n are drawn randomly with replacement from dataset clusters.DT are then grown without pruning until reaching maximum size, using these bootstrap samples.At each node, a split is chosen by randomly selecting a subset of m predictors from the total p predictors (where m << p), known as the random feature selection phase.This process repeats k times, creating a forest of k trees [70].

LOGISTIC REGRESSION
Logistic Regression (LR) is a versatile predictive modeling technique extensively utilized to assess the relationship between dependent (target) variables, typically categorical data with nominal or ordinal scales, and independent (predictor) variables [71].It stands out as a prominent statistical method employed in constructing predictive models, particularly for estimating the probability of an event [72].LR is specifically tailored for making categorical predictions, handling binary or multinomial outcomes by modeling the probability of belonging to a specific category.It achieves this by employing a logistic function to transform the output of a LR model into probabilities, ensuring predictions fall within the range of 0 to 1 [73].The equation of LR, represented as: In this equation,   represents the slope of independent attributes, and   signifies an independent attribute in record j.The variable nn denotes the number of independent attributes, and j signifies the number of records in the dataset [70].

E. PERFORMANCE EVALUATION
Model performance evaluation is a crucial aspect of this study [59], primarily focusing on the AUC, which holds significant importance in evaluating the effectiveness of data categorization [74].AUC provides a quantitative measure of the model's ability to distinguish between different classes, with values ranging from 0 to 1.A value of 1 indicates perfect separation between classes, while a value of 0.5 suggests random categorization [75].Analyzing AUC values provides valuable insights into the discriminatory power of the model and its performance in accurately classifying instances.The equation of AUC, represented as: (10) In this equation, the AUC represents the integral of the True Positive Rate (TPR) plotted against the False Positive Rate (FPR), where t signifies various classification thresholds [76].

F. T-TEST
This test focused on the difference in AUC values to evaluate the average performance of the model and determine its significance [78].Setting the alpha (α) value at 0.05, a common significance level, provides a confident basis for rejecting the null hypothesis with 95% certainty in statistical testing.A t-Test result below this threshold indicates strong statistical significance.While alpha levels can be adjusted, 0.05 is generally accepted as a practical compromise [79].The equation of T-Test, represented as: In this equation,   and   represent the mean values from groups 1 and 2, respectively.  stands for an estimate of the pooled standard deviation of the measurements.Additionally,   and   denote the number of observations for each group [58].

III. RESULT
This study embarks on a comprehensive assessment aimed at gauging the efficacy of synthetic data generated through SDV in tackling the persistent challenge of CI within the domain of CPDP.Through a meticulous and comparative investigation, we delve into the performance analysis of SDV-generated synthetic datasets in contrast with those fashioned by the widely adopted SMOTE technique.
Drawing upon a diverse array of data gleaned from 19 projects, our research endeavors to unveil the nuanced intricacies of synthetic data's efficacy in addressing CI challenges within CPDP.TABLE 3 -5 show the empirical evidence meticulously gathered and analyzed throughout our study firmly establishes the superiority of SDV-generated synthetic data over both the original imbalanced dataset and those artificially balanced by the SMOTE technique.
Moreover, the study displays the results of the evaluation using five evaluation algorithms, enhancing the robustness of the findings.These evaluation algorithms likely encompass a range of metrics such AUC, among others.The use of multiple evaluation algorithms helps provide a comprehensive understanding of the performance of synthetic data generated through SDV and SMOTE across various dimensions.
These robust findings not only underscore the substantial potential of SDV synthetic data in rectifying CI issues but also shed light on its transformative impact on predictive modeling paradigms within CPDP.By offering novel insights and statistically superior outcomes compared to conventional methods like SMOTE, our study heralds a new era in the realm of CI strategies within CPDP.
During the initial validation stage, one project was designated as the testing dataset, while the others projects were utilized as the training datasets.Subsequently, in the subsequent validation stages, the dataset from the next project was chosen as the test data, with datasets from the remaining projects employed as the training data.This iterative process continued until all projects had been utilized as testing datasets.SDV exhibits superior performance compared to SMOTE in handling imbalanced datasets, consistently outperforming both original imbalanced datasets and those balanced using SMOTE across various classification algorithms and datasets.However, SDV techniques demand substantial computational resources and expertise to generate high-quality synthetic data accurately representing the underlying distribution.In contrast, SMOTE is simpler and less resource-intensive but may produce synthetic samples sensitive to noise and outliers, potentially leading to overfitting or decreased model performance.Imbalanced datasets reflect real-world scenarios, yet their inherent bias can cause classifiers to favor the majority class, resulting in suboptimal predictive performance for minority classes.Therefore, while imbalanced datasets remain representative of practical applications, employing SDV or SMOTE techniques requires careful consideration of computational requirements and potential impacts on model generalization.FIGURE 2 depicts a graph comparing the average results of the proposed method with those of other methods across different datasets.Each dataset is represented as a cluster of bars along the x-axis, with each bar providing a visual representation of the mean outcomes of the method examined within the corresponding dataset.
Following the attainment of average AUC results for each project, we conducted a significance test utilizing the t-Test to ascertain whether our proposed method exhibited statistical significance compared to others.Conversely, if the t-Test value exceeds the alpha threshold, the observed performance improvement is considered statistically non-significant.In this study, the t-Test results suggest that there are significant differences between the performance of different methods, particularly in the context of addressing CI in datasets.For instance, comparing the proposed method against SMOTE and imbalance approaches with five classifiers, across various datasets, the p-values are consistently low.This indicates that the proposed method yields statistically significant improvements over others.
Moreover, the significance levels vary across different datasets and algorithms.For instance, in the MDP dataset, the p-values for all method comparisons are extremely low, suggesting highly significant differences.On the other hand, in the PROMISE and RELINK dataset, while most comparisons still yield low p-values, indicating significance, there are instances where the significance levels are slightly higher.This variability underscores the importance of considering dataset-specific characteristics when evaluating the effectiveness of different methods.
Overall, the t-Test results, coupled with the alpha value, provide strong evidence to support the superiority of the proposed method in addressing CI compared to traditional approaches across multiple datasets.

IV. DISCUSSION
The study substantiates the remarkable efficacy of our proposed methodology in effectively addressing the intricate challenge of CI within the realm of CPDP.Through the judicious utilization of Synthetic Data Generation via SDV to rectify CI, our approach distinctly demonstrates superior performance when juxtaposed against five utilized classifiers.Notably, it surpasses conventional methodologies such as SMOTE and imbalance data scenarios, thereby underscoring its robustness and effectiveness.
A meticulous and comprehensive comparative analysis reveals the consistent outperformance of our approach over SMOTE across a myriad of datasets, as meticulously delineated in TABLE 3 -5.Furthermore, leveraging the rigorous statistical tool of t-Test, as outlined in TABLE 6, we establish statistical significance, thereby unequivocally showcasing the superiority of our SDV approach over SMOTE across all evaluated datasets.
The synthetic data generated through SDV consistently exhibits superior performance across diverse evaluation metrics, with particular prominence observed in the realm of the AUC metric.Notably, the steadfast superiority of the KNN algorithm underscores the pivotal role of algorithmic selection in effectively mitigating the challenges associated with CPDP.
While antecedent studies have explored an array of techniques to grapple with CI, our investigation empirically substantiates that SDV presents a more efficacious resolution within this domain.This assertion gains further credence through the comparative analysis presented in TABLE 7, which unequivocally underscores the supremacy of our method over alternative techniques.[13] 0.633 S-DCCA [19] 0.632 NASA MDP Propose Method with KNN 0.704 AD-SMOTE [18] 0.664 PROMISE Propose Method with KNN 0.750 GSMOTE-NFM [80] 0.715 SMOTE-GB [17] 0.649 SMOTE-RF [17] 0.667 However, we conscientiously acknowledge the inherent constraints in our study.The reliance on specific datasets inevitably curtails the generalizability of our findings, while the focus on the AUC metric and select classification algorithms may inadvertently overshadow other salient facets of model performance assessment.Furthermore, the computational intricacies attendant to the SDV technique pose pragmatic challenges in real-world deployment, warranting further exploration and refinement.The study also does not rule out the possibility that further exploration may lead to overfitting when employing data generated by SDV.This consideration underscores the need for caution in extending the application of SDV-generated data the scope of this study.
Nevertheless, our study yields pivotal findings that carry profound implications for the field of CPDP.Academically, we offer invaluable insights into enhancing the reliability and precision of defect prediction models by showcasing the efficacy of synthetic data generation through SDV.The implementation of SDV techniques stands poised to usher in more precise and reliable forecasts of software defects, thereby bolstering the quality and dependability of software products.
Moreover, the seamless integration of SDV holds promise for streamlining the development lifecycle, curtailing maintenance expenditures, and ultimately elevating customer satisfaction levels.Additionally, synthetic data serves as an indispensable tool for safeguarding sensitive personal information that cannot be divulged, thereby ensuring compliance with stringent data privacy regulations.
Furthermore, we ardently advocate for the exploration of alternative methodologies within SDP frameworks to mitigate CI, surpassing traditional techniques such as SMOTE.In essence, our research underscores the transformative potential of our proposed methodology in reshaping the landscape of CPDP.By effectively mitigating CI through SDV, our approach engenders robust predictive models that surpass existing methodologies, thereby offering a compelling roadmap for future research endeavors aimed at augmenting the efficacy and applicability of defect prediction models.

V. CONCLUSION
This research aims to tackle a common challenge in CPDP, namely CI, by leveraging synthetic data generated by SDV.SDV works to balance the data by creating minority classes, thereby achieving a balanced distribution of instances across classes.Using five different classification algorithms and the AUC metric, this study thoroughly investigated the performance of synthetic data generated by SDV compared to traditional methods like SMOTE across 19 selected projects.
Our study unequivocally demonstrates the superiority of synthetic data generated by SDV in addressing CI within CPDP.Across all analyzed datasets, SDV consistently outperformed both the original unbalanced datasets and those balanced using SMOTE, as evidenced by higher AUC scores.Specifically, various methods, including Relink, Nasa MDP, and PROMISE, showed sequential improvements.KNN achieved AUC scores of 0.695, 0.704, and 0.750 for the respective datasets, while DT attained scores of 0.655, 0.623, and 0.610.LR yielded AUC scores of 0.612, 0.644, and 0.633, whereas NB obtained scores of 0.626, 0.581, and 0.638.RF received AUC scores of 0.651, 0.607, and 0.634.These results confirm that the utilization of synthetic data from SDV significantly enhances model performance in addressing CI in CPDP.
To address the limitations identified in this study, future research could explore the application of SDV techniques across a broader range of datasets and project contexts to enhance generalizability.Additionally, investigating the performance of SDV in conjunction with other machine learning techniques and performance measures could provide a more comprehensive understanding of its capabilities.Moreover, efforts to mitigate the computational overhead associated with SDV implementation could facilitate its adoption in real-world CPDP scenarios, as well as further exploration for addressing the overfitting problem.

FIGURE 2 .
FIGURE 2. Performance Comparison of the Proposed Method and Others

TABLE 2 Recursive Application of CPA to add Derived Columns to T
The conditional vector is denoted as   * =  * , where  = 1, … , |  |, and  , , a discrete variable in   , is initially represented as a one-hot vector  , with dimension |  |.During training, the conditional generator is permitted to generate any set of one-hot discrete vectors.