Feature Selection Using Firefly Algorithm with Tree-Based Classification in Software Defect Prediction

Defects that occur in software products are a universal occurrence. Software defect prediction is usually carried out to determine the performance, accuracy, precision and performance of the prediction model or method used in research, using various kinds of datasets. Software defect prediction is one of the Software Engineering studies that is of great concern to researchers. The purpose of this research is to improve the performance produced by the Decision Tree, Random Forest, and Deep Forest classification methods by selecting the Firefly feature in predicting software defects. In addition, it is also to find out a tree-based classification algorithm with Firefly feature selection that can provide better software defect prediction performance. The dataset used in this study is the ReLink dataset which consists of Apache, Safe and Zxing. Then the data is divided into testing data and training data with 10-fold cross validation. Then feature selection is performed using the Firefly Algorithm. Each ReLink dataset will be processed by each tree-based classification algorithm, namely Decision Tree, Random Forest and Deep Forest according to the results of the Firefly feature selection. Performance evaluation uses the AUC value (Area under the ROC Curve). Research was conducted using google collab and the average AUC value generated by Firefly-Decision Tree is 0.66, the average AUC value generated by Firefly-Random Forest is 0.77, and the average AUC value generated by Firefly-Deep Forest is 0.76. The results of this study indicate that the approach using the Firefly algorithm with Random Forest classification gets better results compared to other tree-based algorithms.


I. INTRODUCTION
Software systems continue to serve important functions in every aspect of our society, the presence of a flaw in such a system can have a major impact on the economy and the general population [1].Software development projects necessitate a phase of software testing which is of utmost importance and incurs significant costs for investigating the efficacy of the resultant product [2].A software defect denotes a flaw, error, bug, mistake, fault, or failure within a computer system or program that may result in an unexpected or inaccurate outcome or hinder intended software behavior [3].To attain high-quality software, the final product must have minimal defects.Early detection of software defects can lead to reduced development costs, rework efforts, and more dependable software [4].Defect prediction is an exceedingly dynamic domain within software analytics [5].The utilization of software defect prediction metrics is of parfrequency significance in the development of a prediction model, which has the objective of enhancing software quality by foreseeing a maximal number of software defects [6].In research conducted by Andini et al [7]. in his research using a tree-based classification with hyperparameter tuning, the average AUC value generated by a Decision Tree is 0.69, while the average AUC value generated by a Random Forest is 0.76 and the average AUC value produced by Deep Forest is 0.79.In another study by Anbu et al. the Firefly optimization method was used to improve software defect prediction performance as feature selection, the final results of the study concluded that the Firefly search algorithm is effective for feature selection problems with the results of classification accuracy SVM-with FS has better accuracy by 4.53% compared to SVM-without FS, by 5.4% compared to KNN without FS, by 11% compared to NB-without FS.In a previous study conducted by Zhou et al. proposes several methods such as Deep Belief Networks (DBN), Random Forest (RF), Naive Bayes (NB ), Logistic Regression (LR) and Support Vector Machine (SVM) in predicting software defects.The data used are NASA, PROMISE, AEEEM and ReLink datasets.Based on the comparison results, for the NASA dataset it can be seen that DPDF has the best performance, AUC increases and the highest value is 92%.For the PROMISE and AEEEM datasets, the DPDF results are also better than the others, with the highest scores of 89% and 86%.And across multiple datasets ReLink, DPDF has not outperformed RF and DBN, the highest score is 75%.Feature selection plays a crucial role in a plethora of applications owing to its indispensability in ensuring generalization, performance, computational efficiency, and feature interpretability [8].So in this study research will be carried out on the application of feature selection in predicting software defects using the Firefly algorithm for tree-based classification, namely Decision Tree, Random Forest classification, and Deep Forest with the aim of improving the resulting performance.

II. METHOD
This research method describes the dataset used, Decision Tree algorithm, Random Forest, Deep Forest, Firefly algorithm, test validation using cross validation, and performance measurement using the evaluation method using AUC.The following is the research procedure that will be carried out.Figure 1 show the flow of this research.

FIGURE 1. Research flow with feature selection
In this study, a flowchart is presented in Figure 1.The initial step involves collecting the ReLink dataset, followed by sharing the data using cross validation.The validation technique adopted in this study is 10-fold validation.To achieve this, each ReLink dataset is partitioned into 10 sections, with 9 sections designated as training data and the remaining section used as test data.Subsequently, feature selection is executed via the Firefly algorithm before classification.The classification phase involves three scenarios, which are Decision Tree, Random Forest, and Deep Forest.The study's evaluation employs the average AUC value.The experimentation was conducted using Python Google Collaboratory.

A. DATA COLLECTION
The dataset used in this study is a software metrics dataset called ReLink, which consists of Apache, Safe, and Zxing data.This dataset can be downloaded at the following link https://github.com/bharlow058/AEEEM-and-otherSDPdatasets/tree/master/dataset/Relink.TABLE 1 shows the frequency of data that varies in each ReLink dataset, namely Apache with 194 data, Safe with 56 data and Zxing with 399 data.Then explains the ReLink dataset software metrics which are grouped into 2 software metric categories (groups), namely Complexity Metric (CPM) and Count Metric (CTM).The ReLink dataset has the same number of software metrics [7].

B. DATA SHARING 1. CROSS VALIDATION
The reduction of bias in the case of random sampling of datasets is accomplished through the implementation of cross validation [9].Cross Validation divides the original data into training data and testing data [4].It consists of randomly dividing the data set into K parts [10].One part is used to validate the model and the rest to train the classifier.This process is repeated K times, selecting different validation subsets.Cross Validation divides raw data into training data and testing data randomly.Weaknesses that K-Fold Cross Validation has when using unbalanced data where there is a possibility of causing some data to be lost and only testing a few instances so that there are still many untested [11].

C. FEATURE SELECTION 1. FIREFLY ALGORITHM BASED FEATURE SELECTION
Feature selection constitutes a combinational optimization problem [12].The Firefly algorithm (FA) is a novel population-based meta-heuristic algorithm that exhibits exceptional performance on a multitude of optimization problems [13].The Firefly Algorithm is algorithm that draws inspiration from the light flashing behavior of the original Firefly [14].It should be noted that every Firefly has its unique position that is determined by the number that is generated for each of them [15].Firefly Algorithm for discriminatory features selection of classification and regression models to support the decision-making process using database-based learning methods [16].It can be posited that the algorithm in question has achieved a remarkable level of success, despite its relatively low cost [17].This algorithm is inspired by the blinking behavior of a Firefly, a randomly generated solution will be treated as a Firefly, and the brightness assigned depends on its performance in the objective function [14].The brightness of a Firefly is determined by evaluating the fitness function.For the problem of maximizing brightness, it can be compared with the value of the objective function (fitness function) [18].The Firefly algorithm exhibits superior capacity to evade trapping in local optima, alongside a marked enhancement in both the speed of convergence and precision of solutions [19].
The attractiveness of the Firefly is determined by its brightness, which is contingent on the light intensity.The calculation of attractiveness for each Firefly is accomplished through the utilization of Equation ( 1) [14].
where variable β0 is utilized to signify the level of appeal at the point where distance (r)=0, and in certain instances, it is regarded as equivalent to the value of one for mathematical computations.Meanwhile, the symbol γ is representative of the degree of light absorption.It should be noted that r denotes the distance between two fireflies, i and j, who are in constant motion from one position to another.It is a well-established fact that the degree of attractiveness between these fireflies is closely linked with the distance that separates them.Therefore, the distance between two fireflies, i and j, is determined using the Euclidean distance law [14].calculated by the equation (2).
(2) where d denotes the dimensions of the given problem, xi,k corresponds to the k-th component of the Firefly position i.After calculating the distance between two fireflies, if Firefly i exhibits a lower luminosity compared to Firefly j, then the resulting attraction between the two occurs when Firefly i moves towards Firefly j. the movement in question is governed by Equation (3) [14], which is stated as follows: (3) where t denotes the number of iterations, the coefficient α denotes a stochastic variable governing the magnitude of the random walk, and rand signifies a random number generator that falls within the interval [0,1].The Firefly with lower luminosity translocates towards the brighter Firefly after considering three factors [14].The first factor corresponds to the current position of the less luminous Firefly.The second factor denotes the movement towards the brighter Firefly, which is guided by the attraction coefficient β.Finally, the last factor corresponds to a type of random walk that is computed by a random generator multiplied by α.

D. CLASSIFICATION 1. DECISION TREE CLASSIFICATION
A Decision Tree(DT) is a classification technique utilized in data mining that constructs a model in a top-down treelike fashion, predicated on the attributes intrinsic to a designated data set [20].The Decision Tree classification method is capable of resolving both binary and multi-class classification problems in data mining classification [21].As with an ordinary tree, the Decision Tree comprises a root, branches, and leaves, adhering to the same structure [22].The essence of DT lies in its hierarchal and predictive modeling strategy, wherein the item's observation serves as branches to determine the item's target value in the leaf [23].

FIGURE 2. Struktur Decision Tree[24]
This implies that it is a coordinated tree through a node called the "root," with no imminent edges, while various other nodes have only one imminent edge.An inner or exam node is referred to as a center with complex edges.Each additional node is titled as either greeneries or incurable or excellent nodes.The leaf node is linked to the name of the class.The Decision Tree is an integral constituent of the planning set [24].A Decision Tree of this nature is depicted in FIGURE 2.

RANDOM FOREST CLASSIFICATION
Random Forest(RF) algorithm is a supervised classification algorithm, as indicated by its name, which involves the creation of a forest through a random process.The number of trees within the forest directly affects the accuracy of the outcomes, with larger numbers of trees resulting in greater precision [25].Random Forest classification is done by obtaining the majority class votes from the individual vote class trees [26].One important benefit associated with RF relates to the fact that there is no need to prune individual trees, given the presence of multiple trees.However, the disadvantage is that due to the large number of trees, the ability to visualize them effectively is impaired [27].This method is underpinned by two primary principles: row sampling and voting classifier.The provided records are resampled and then forwarded to the next base learner models for training.Aggregating is the voting classifier concept, where the output for test data is chosen for the class with the highest vote from the base learner models [28].A generalized model for the Random Forest is depicted in FIGURE 3.

DEEP FOREST CLASSIFICATION
Deep Forest is a new tree based classification algorithm which is an improvement over Random Forest algorithm.Deep Forest is referred to as an alternative Deep Neural Network (DNN), Deep Forest has parts or components, namely a layer-by-layer structure called a cascade forest [29].A cascading forest is a distribution of classes generated by each tree for each instance [30].The image presented below illustrates the layered nature of the algorithm, where each layer is stacked one on top of the other.The initial layer obtains input from the original dataset's attributes or features, which are then handled by the Random Forest in the next layer (FIGURE 4).The layer will stop if the process generated Random Forest does not increase or if the output at the given layer decreases.The Deep Forest algorithm will average the results from layer to layer to the final layer of each layer level.The downside is that Deep Forests take longer to process than Random Forests [7].

E. EVALUATION OF RESULT
The features of this study were taken from 3 ReLink datasets obtained from the github repository, each of which has 26 features.Feature selection is an important step in data analysis, because the right features will improve the classification performance of the model.In this study, feature selection was performed using the Firefly Algorithm to improve feature selection efficiency and improve the accuracy of the tree-based classification model.Firefly is used to find the best feature combination that gives higher AUC performance than the classification model.In this study, 10 trials were carried out to find out the average AUC value obtained.After implementing Firefly, the final results show a comparison of the AUC between models that use a combination of classification and selection of Firefly features and models that use classification and hyperparameter tuning, so that it can be seen whether the implementation of the Firefly feature provides an increase in AUC performance in classification on prediction of software defects.Evaluation of the classification performance of the Decision Tree, Random Forest and Deep Forest models for each ReLink dataset uses the AUC (Area under the ROC Curve) value.The AUC represents the area under the ROC curve and has been recommended for improving cross-study comparability.Its potential for significantly enhancing convergence across empirical experiments in software defect prediction lies in its ability to disentangle predictive performance from operating conditions, thereby serving as a general measure of predictiveness [31].

III. RESULT
The TABLE 2 shows the performance of a tree-based classification algorithm with the Firefly search feature on the Apache dataset.The TABLE 3 shows the number of times a feature appears in 10 trials using Firefly feature selection on the Apache, Safe and Zxing datasets.

IV. DISCUSSION
In this study, a total of 10 trials were carried out to obtain the average value.The results of the software defect prediction assessment of the three ReLink datasets on the area under the curve (AUC) values obtained from the ten experiments conducted are presented in Tables 2, 4, and 6.Due to the random selection of the Firefly feature according to the best intensity, the selected features change with each trial, resulting in varying AUC values, some higher and some lower.It should be noted that the AUC values obtained from each experiment are different, with the optimal average number of features used being 12 features.The feature selection carried out by the fireflies on all tree-based classification algorithms has proven successful in elevating software defect prediction performance compared to prior studies that employed hyperparameter tuning.This is evidenced by the superior average performance of each proposed method, as shown in Table 10, relative to previous methodologies.

V. CONCLUSION
This study aims to predict software defects in the ReLink dataset through the application of Decision Tree, Random Forest, and Deep Forest tree-based classification with Firefly feature selection.The performance of these models varies, as evidenced by the comparison results in experimental trials.Specifically, Firefly's feature selection was found to improve AUC performance when compared to previous studies using hyperparameter tuning for tree-based classification.In addition, Firefly feature selection combined with tree-based classification outperformed previous studies using the Naïve Bayes (NB) method, as well as Logistic Regression (LR) and Support Vector Machine (SVM).Overall, these findings highlight the potential benefits of using Firefly feature selection with tree-based classification to perform well in predicting software crashes.The findings of the research indicate that the application of the Firefly feature selection in conjunction with Random Forest classification yields superior performance in comparison to feature selection utilizing other classifications based on trees.This is evidenced by an average AUC value of 0.77, an average feature usage of 12 out of 26 features, and the most frequently occurring feature belonging to the Count Metric category.Thus, the results suggest that the features categorized under CountMetric are the most effective.In future studies, the tree-based algorithm will be tested with the firefly selection feature on other datasets that have a higher score ratio.The goal is to find out better algorithm performance in predicting software defects.Another further research is experimenting with firefly features with other classification algorithms in predicting software defects.The aim is to find out the search for firefly features with a classification algorithm that is expected to get a better performance value.

FIGURE 4 .
FIGURE 4. The architecture of the cascade forest[29]

TABLE 3 ,
TABLE 4, and TABLE 5describe the frequency of characteristic selection via the utilization of the Firefly algorithm on the 26 features of the ReLink dataset, organized according to their respective degree of implementation.Among the plethora of garnered findings, the Matrix Category Count Metric (CTM) emerges as the most frequently employed feature.TABLE4and FIGURE5show the average AUC values achieved across all ReLink datasets and the most frequently used feature sets.

FIGURE 5. Graph of AUC performance comparison with previous studies
. resulting in an average AUC value of 0.66.The Random Forest parameter is set to the default value or without the Firefly feature selected.resulting in an average AUC value of 0.72.And the Deep Forest parameter is set to the default value or without the Firefly selection feature resulting in an average AUC value of 0.73.In this study.the Decision Tree parameters with Firefly feature selection produced an average AUC value of 0.66.Setting the Random Forest parameter by selecting the Firefly feature produces an average AUC value of 0.77.And the Deep Forest parameter is set by selecting the Firefly feature so that it produces an average AUC value of 0.76. feature